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ABSTRACT 



In this thesis I explore challenging discrete energy minimization problems 
that arise mainly in the context of computer vision tasks. This work mo- 
tivates the use of such "hard-to-optimize" non-submodular functionals, and 
proposes methods and algorithms to cope with the NP-hardness of their op- 
timization. Consequently, this thesis revolves around two axes: applications 
and approximations. The applications axis motivates the use of such "hard- 
to-optimize" energies by introducing new tasks. As the energies become less 
constrained and structured one gains more expressive power for the objec- 
tive function achieving more accurate models. Results show how challenging, 
hard-to-optimize, energies are more adequate for certain computer vision ap- 
plications. To overcome the resulting challenging optimization tasks the sec- 
ond axis of this thesis proposes approximation algorithms to cope with the 
NP-hardness of the optimization. Experiments show that these new methods 
yield good results for representative challenging problems. 
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In this thesis I explore challenging discrete energy minimization problems 
that arise mainly in the context of computer vision. From binary energies 
for figure-ground segmentations through multi-label semantic segmentation, 
stereo, denoising, to inpainting and image editing (e.g., Szeliski et al (2008); 



Pritch et al (2009); Bagon (2012)). These energies usually involve thousands 



of variables and dozens of discrete states. Moreover, most of the energies in 
this domain are pair- wise energies, that is, they only involve interactions 
between pairs of neighboring variables. 

The optimization of these discrete energies is known to be NP-hard in 
most cases (Boykov et al ( 2001[ )). Still, despite this theoretical hardness, 
instances of these energies that have special properties may give rise to poly- 
nomial time global optimal algorithms. Other instances with slightly different 
properties allow, in practice, for good and efficient approximation schemes. 

The next chapter (Chap.[T]) reviews previous work related to discrete pair- 
wise energy functions and their optimization. It outlines the properties and 
conditions under which global optimization is feasible, and the conditions re- 
quired for successful practical approximations. Chapter [T] also surveys several 
key approximation algorithms. It provides a brief outline of the properties 
of the energy that must be met in order for each algorithm to succeed. The 
conclusion of this survey is that discrete pair-wise energies may be broadly 
classify into two categories: 

smoothness-encouraging energies: energies that favor configurations with 
neighboring variables taking the same discrete state. 

contrast-enhancing energies: energies that encourage solutions where neigh- 
boring variables take different states. 

So far the energies mainly used in computer vision tasks are of the first 
category: smoothness-encouraging (see e.g., |Szeliski et at] fl2008D ). These 
smoothness-encouraging energies allow for efficient approximation schemes. 
On the other hand, contrast-enhancing energies are far more challenging 
when it comes to optimization, and are indeed less popular in practice. 
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In this thesis I would like to step outside of this "comfort-zone" of the 
smoothness-encouraging energies and explore more challenging discrete en- 
ergies. This work revolves around two axes: 

1. Applications: The first motivates the use of such a hard-to-optimize" 
functionals by introducing new applications. As the energies become 
less constrained and structured one gains more expressive power for the 
objective function achieving more accurate models. Results show how 
contrast-enhancing, hard-to-optimize, functionals are more adequate 
for certain computer vision tasks. 

2. Approximations: To overcome the resulting challenging optimization 
tasks the second axis of this thesis proposes methods and algorithms 
to cope with the NP-hardness of this optimization. Experiments show 
that these new methods yield good results for representative challenging 
problems. 



1. DISCRETE PAIR- WISE ENERGIES A REVIEW 



Discrete energy minimization is a ubiquitous task in computer vision. From 
binary energies for figure-ground segmentations through multi-label semantic 



segmentation, stereo, denoising, to inpainting and image editing (e.g., Szeliski 



et al (2008); Pritch et al (2009); Bagon (2012)). In my thesis I focus on 



various types of minimization problems of pair- wise energies as they arise in 
different computer vision applications. These discrete optimization problems 
are, in general, NP-hard. Yet, there are cases in which the minimization of 
a pair- wise energy can be solved exactly in polynomial time. In this intro- 
ductory chapter I survey different properties of discrete pair- wise energies. I 
show how these properties of the energies relate to the inherent difficulty of 
the optimization task. Some properties entail exact optimization algorithms, 
while other properties admit efficient approximations. The most important 
property is "smoothness-encouraging" : an energy that prefers the labels of 
neighboring variables to be the same. For these "smoothness-encouraging" 
energies there exist efficient approximate minimization algorithms. In con- 
trast, energies that encourage neighboring variables to have different labels 
are much more challenging to minimize. For these "contrast-enhancing" en- 
ergies existing algorithms provide poor approximations. This thesis focuses 
on the optimization of these challenging "contrast-enhancing" energies. 



1.1 Pair-wise Energy Function 

Before diving into the minimization task, this section presents the discrete 
pair- wise energy function and the notations that are used in this thesis. It 
also provides some insights and motivation for the use of such energies. 
A discrete pair- wise energy is a functional of the form 

F ( x ) = ^ x ^ + & 

ije£ 1 

It is defined over n variables {x^ i — 1, . . . , n), each taking one of I discrete 
labels {xi G {1, . . . ,/}), where S represents a set of neighboring variables. 
The term cpi [xi) is a unary term reflecting the compatibility of label Xi to 
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variable i (also known as a "data term"). The pair- wise term, ip^ (x^Xj) 
reflects the interaction between labels Xi and x j assi gned to variables i and 
j respectively. 

If we were to discard the pair- wise term, minimizing energy (1.1) is sim- 
ply choosing the label that best fit each variable separately. However, the 
presence of the pair- wise term introduces dependencies between the different 
variables and turns the optimization into a much more complicated process. 
Despite the local nature of the pair-wise term - binding the values of only 
neighboring variables introduces global effects on the overall optimization. 
Propagating the information from the local pair- wise terms to form a global 
solution is a major challenge for the optimization of Eq. (1.1). 

The energy function of Eq. (1.1) has an underlying structure defined by 
the choice of interacting neighbors. It is common to associate with F (x) 
a graph Q = (V,£), where the set of nodes V = {1, . . . ,n} represents the 
variables, and the edges 8 represents the interacting neighboring pairs. 

The following example shows how such discrete functional may arise in a 
well studied computer vision application. This example also illustrates the 
relation between discrete optimization and inference in graphical models. 



Example: Stereo reconstruction via MRF representation 

Given a rectified stereo pair of images, the goal is to find the disparity 
of each pixel in the reference image. The true disparity of each pixel is a 
random variable denoted by X{ for the pixel at location i. Each variable 
can take one of L discrete states, which represent the possible disparities 
at that point. For each possible disparity value, there is a cost associated 
with matching the pixel in the reference image to the corresponding pixel 
in the other image at that disparity value. Typically, this cost is based on 
the intensity differences between the two pixels, y i) which is an observed 
quantity. We denote this cost by It relates how compatible a 

disparity value Xi is with the observed intensity difference ^. A second 
function expresses the disparity compatibility between neigh- 

boring pixels. This function usually expresses the prior assumption that 
the disparity field should be smooth. Examples of such prior that are 
commonly used are the Potts model: 

the £i similarity: 

^{x h Xj) oc e -1 *'-^ 1 (1.3) 
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or its robust (truncated) version: 

y(x h Xj) oce- min{ ^-^'l' T} (1.4) 

With the two functions $ and ^ the joint probability for an assignment 
of disparities to pixels can be written as: 

P(x, y)oc H x j) II ( L5 ) 

where x is an assignment of a disparity value for each pixel i. X{ is the 
hidden variable (disparity) at location % and yi is the observed variable 
(intensity difference) at location i. ij G £ represent a pair of neighboring 
nodes and is usually taken as a regular 4-connected grid over the image 
domain. 

The resulting graphical model is known as a pairwise Markov Random 
Field. Although the compatibility functions only consider adjacent vari- 
ables, each variable is still able to influence every other variable in the 
field via these pairwise connections. 

We look for a disparity assignment such that the labeling x* maximizes 
the joint probability, i.e., 

x* = argmax P(x, y) (1-6) 

X 

Assuming uniform prior over all configurations x, the Maximum A Pos- 
teriori (MAP) estimator is equivalent to 

x* = argmax P(x|y) (1.7) 

X 

Maximizing the posterior probability is equivalent to minimizing an en- 
ergy functional of the form (1.1). Taking — log of (1.5) yields the following 
function 

F (x) = ^ ~ log ttfo, + ~ lo § Vi) ( L8 ) 
This equation can be expressed as 

^2 Vij ( X ii X j) +yZ i Pi ( X i) ( L9 ) 

where ^ (x u Xj) = - log V(x u xj) and cpi (xi) = - log $(x u yi). 
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Thus MAP (maximum a-posteriori) inference in graphical models such as 
MRF and conditional random fields (CRF) boils down to the optimization 
of a discrete pair- wise energy functional of the form Eq. (1.1) which is 
the focus of this thesis. 



Energy functions of the form Eq. (1.1) arise in many graphical models 
(MRFs and CRFs) (see e.g., Blake et al. (2011) and references therein). 



However, it is not restricted to that domain and are also encountered in 
a variety of other domains such as structural learning 



(e.g. 



Nowozin and 



Lampert ( 2011[ )), and as the inference part of discriminative models (e.g., 
structural SVM |Taskar et a~l] fl2003P ; |Tsochantaridis et al] Q2006D ). This 
thesis explores some challenging instances of these energies and explores new 
methods for improved minimization approaches for these hard-to-optimize 



energies. 



1.2 Minimization of Discrete Pair-wise Energies 



The energy function of Eq. (1.1) presented in the previous section is used in 



many applications to evaluate how compatible is a certain discrete solution 
x to a given problem. It is now desired to find the best solution for the given 
problem by finding a solution x* with the lowest energy. That is, solving the 
optimization problem 



x* arg min F (x) 



(1.10) 



over all discrete solutions x. 



In general, the optimization problem (1.10) is NP-hard. Known hard 



combinatorial problems such as max-cut, multi-way cut and many others 
may be formulated in the form of (1.10) (see e.g., Boykov et al. ( 2001[ )). Yet, 
there are special instances of problem (1.10) which can be optimized exactly 
in polynomial time. The main two factors that affect the difficulty of the 



optimization problem (1.10) are: 



1. The underlying graph structure, 8. 

2. Mathematical properties of the pair- wise interactions, 

When the underlying graph 8 has no cycles (that is, 8 has a tree struc- 



ture) optimization of (1.10) is fairly straight-forward: By propagating infor- 



mation back and forth from the leafs to the root convergence to global mini- 
mum is attained. This information propagation is often referred to as belief- 



propagation (BP) ( |Pearl (1988); Koller and Friedman (2009)). A cycle- free 
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graph is crucial for this rapid polynomial time convergence of this message- 
passing scheme: Every path from root to leaf is unique and therefore consen- 
sus along path is attained in a single forward backward pass. However, when 
8 has cycles, paths between the different variables are no longer unique and 
may give rise to contradicting messages. These contradictions introduce an 
inherent difficulty to the optimization process making it NP-hard in general. 

Despite the inherent difficulty of optimizing (1.10) over a cyclic graph £, 
there are instances of problem (1.10) that can still be efficiently minimized 
when the pair- wise terms, meet certain conditions. The next few sec- 
tions explore these conditions in more detail, providing pointers to existing 
optimization methods that succeed in exploiting special structures of tpij to 
suggest methods and guarantees on the optimization process. For simplicity, 
we start in Sec. |1.3| with the special case of discrete binary variables, that is 

x e {0, i} r 



Then we move on, in Sec. 



1.4 



to the multilabel case of discrete 
variables taking one of multiple possible states, x G {1, . . . , 



1 . 3 Binary pro blems 



Binary optimization problems are of the form (1.10) where the solution space 
is restricted to binary vectors only, i.e., x G {0, l} n . 

The most basic property of binary functionals is submodularity. This 
property is defined as follows: 

Definition 1 (Binary submodular). A pair- wise function F (x) defined for 
binary vectors x is submodular iffMi, j G 8: Lpij (0, 0)+(fij (1, 1) < cpij (1, 0) + 
Vij (0, 1) 

Close inspection of Def. [I] reveals that a submodular function assigns lower 
energy to smooth configurations (i.e. to Xi = xj) than to "contrastive" con- 
figurations (i.e., to Xi Xj). That is the "smooth" state (pij (0, 0) + tfij (1, 1) 



has lower energy than the "contrastive" state (1,0) + ^ (0,1). Fig. 1.1 
provides an illustration of the space of all pair- wise energies. 

Submodularity is an important property of y?y since exact optimization 
of binary submodular functions can be done in polynomial time regardless 



of the graph structure 8 (Greig et al. (1989)). One such optimization al- 



gorithm identifies a 1 : 1 mapping between binary assignments x and cuts 
on a specially constructed graph. Careful choice of weights for the edges on 
this special graph gives rise to a 1 : 1 mapping between the weight of a cut 
and F (x) of the appropriate binary assignment x. This construction and 
choice of weights is illustrated in Fig. |1.2| The appropriate correspondences 
between assignments and graph-cuts, and between cut weight and energy 
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Fig. 1.1: types of binary F(x): submodular vs. non-submodular. Green color 
indicates energies for which exact minimization can be done in polynomial 
time. 



are illustrated in Fig. L3, Once this 1 : 1 mapping is established, optimizing 
F (x) is simply finding a minimum cut of the constructed graph. This can be 
done in polynomial time provided that all the weights of the edges are non- 
negative (Cormen et al. ( 2001[ )). Examining the details of this construction 
reveals that edge weights are non-negative iff the function F (x) is submod- 
ular. Details of this construction can be found in e.g., Greig et al. (1989); 



Boykov et at] fl200ip and in more detail in Kolmogorov and Zabih rt2004h. 

However, when F (x) is not submodular (i.e., there exists at least one 
pair- wise term (fij for which the submodularit y inequality ([T ) does not hold) 
the optimization of F (x) becomes NP-hard (Rother et al. (2007)). In that 
case exact minimum cannot be guaranteed in polynomial time and an approx- 
imation is sought. One notable approach for approximating non-submodular 
binary optimization problems is by an extension of the min-cut approach. 
This method, called QPBO (Quadratic Pseudo-Boolean Optimization) was 
proposed by Hammer et al.\ Q1984D and later extended by |Rother et al.\ ( |2007 ). 
The main idea behind this approximation scheme is that when the energy 
function F (x) is non-submodular, the derived graph has edges with negative 
weights. Therefore, they propose to construct a redundant graph in which 
each variable is represented by two nodes (rather than only one as in the 
original construction). One node represents the case where the variable is 
assigned and the other node represents the case of assigning 1 to the vari- 
able. This redundant representation eliminates the need for negative edge 
weights and thus a min-cut of the new graph can be computed in polyno- 
mial time. Looking at the resulting min-cut we can discern two cases for 
each variable: The first, in which the cut separates the two complementary 
nodes representing this variable. In this case, the cut clearly defines an opti- 
mal state for the variable (either or 1). However, there is a second case in 
which both complementary nodes fall in the same side of the cut. In this case, 
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F (x) = & (x^ + tpj (xj) + (fij (x h Xj) 



9+1 
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Binary energy over two variables i and j parameterized 
by 8 parameters a — h. 

Fig. 1.2: Graph construction: For a binary energy F (x) a weighted graph is 
constructed in the following way: Each variable X{ has a corresponding 
node i. In addition, two special nodes (denoted in this figure as and 
are added. Two edges connect each i th node to the special nodes 
|~0] and \Y\. In addition, an edge (i,j) is added for each interacting 
pair of variables (for which (fij exists). The weights of the edges are 
defined according to the parameters of the energy as illustrated in the 
figure. Note that the weight of the (z, j) edge, g + f — e — h, is exactly 
(fij(0, 1) + ^r/(l, 0) — (fiij(0, 0) — (fij(l, 1), this weight is non-negative iff 
(fij is submodular. 
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Fig. 1.3: Binary energies and graph-cuts: All four possible cuts of the graph 
constructed in Fig. A cut (dashed red) separates the two special 

nodes, _0_ from \Y\. Cut edges (edges pointing from the _0_ side to the 
|Tj side) are marked with dash lines. First two rows show the 1 : 1 
mapping between a cut and an assignment to x. Last two rows show the 
1 : 1 mapping between the weight of the cut (using the weights defined in 



Fig. 1.2) and the energy F (x) of the appropriate assignment. 
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^^^^^^^^ structure 


Tree 


Cyclic 


pair-wise ^^^^^^^^ 




submodular 


Easy: mincut, BP 


Easy: mincut 


non-submodular 


Easy: QPBO, BP 


Hard 



Tab. 1.1: Hardness of binary optimization: the computational "hardness" of 
a discrete minimization as a function of the underlying graph structure 
(£), and the class of pair-wise interactions (<fij). 



we are unable to determine what is the proper assignment for this variable 
and the variable remains unlabeled. 

Therefore, we can conclude that QPBO extends the min-cut approach to 
handle non-submodular binary energies. Recovery of the global minimum is 
no longer guaranteed, but the algorithm may recover a partial labeling that 
is guaranteed to be part of some globally optimal solution. However, in the 
extreme case it may happen that QPBO is unable to label any variable. 

Table summarizes the different types of pair- wise binary energy func- 
tions and the difficulty they entail on their optimization. 



1.4 Multilabel problems 
A multilabel discrete problem is the optimization of a discrete function F (x) 



of the form (1.1) defined over a finite discrete vector x £ {1, . . . , L} n . As was 
shown in the previous section, properties of the pair-wise terms (pij of F (x) 
have a crucial effect on the computational complexity of the optimization 
problem. For the binary case the only distinctive property was the sub- 
modularity of F (x). However, when discrete variables over more than two 
states are considered, there are more subtle types of pair- wise interactions 
that affect the ability to optimize, or at least efficiently approximate it. This 
section describes these various types of (pij and their effect on the discrete 
optimization task. 



The following definition of multilabel submodularity is given in |Schlesingei 
and Flach| ( [2006l ). 



Definition 2 (Multilabel submodular). Assume the labels (a, /3, . . .) are fully 
ordered. Then F (x) is multilabel submodular iffVi,j G £ and Ma < 
7, P < S the following inequality holds 



(fij (a, ft) + (fij (7, 5) < (fij (a, 5) + <p y (7, p) 



(1.11) 
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A slightly simpler and equivalent condition for submodularity uses the 
following condition: 

<Pij (a, p) + ifij (a + 1, $ + 1) < ifij (a, + 1) + ^ (a + 1, /3) (1.12) 

F (x) is multilabel submodular iff fll.l2| ) holds for all labels a ) f3 and for all 

The notion of submodularity is strongly related to Monge matrices 



(Cechlarova and Szabo ( 1990[ )): A matrix V is a Monge matrix iff V a ^ + 



V^+i^+i < V^+i^+V^^+i, Va, /5. Monge matrices were defined by the French 



mathematician G. Monge ([1781 ), and they play a major role in optimal trans- 



portation problems and other discrete optimization tasks (see, e.g., Burkard 



(2007)). Consider a matrix V G Mr whose entries are V a p=(Pij (a, /3), then 
(fij is multilabel submodular iff V is a Monge matrix. 

A sub class of multilabel submodular functions are functions where (p^ 
are convex on the set of labels. Convexity on a discrete set is defined in 
Ishikawa ( 2003[ ), as follows: 



Definition 3 (Convexity on a discrete set). A real valued function g(x) is 
convex on a set A iff 

g {tx + (1 - t)y) < tg (x) + (1 - t)g (y) (1.13) 

for all x,y G A and t G [0, 1] s.t. tx + (1 — f)y G A 

When the set of labels is fully ordered and if Vi, j tp^ (x^ xj) = g^ {x{ — xf) 
and all g^ are convex, then F(x) is convex. For example (fij(xi,Xj) = 
\x{ — Xj\ p . Note that a truncated £ p norm is no longer convex. 



Claim 1.1. Convex is a special case of submodular Schlesinger and Flach 



(2006). 



Proof. Let (p(a,/3) = g (a — (3) for some convex g (x). Then for all ol,($\ 
I(a-/3-l) + i(a-/3 + l)) < ±g (a - - 1) + X -g (a - + 1) 

g(a-0) < ^g(a-0-l) + ^g(a-0 + l) 

g(a- f3)+g(a- f3) < g (a - - 1) + g (a - + 1) 
g(a-0)+g(a + l-(0 + l)) < g (a - (0 + 1)) + g (a + 1 - 0) 
<p(a,0) + (p(a + l,0 + l) < cp(a + 1,0) + ip{a,0 + 1) 



From property (1.12) it follows that a convex ip is also submodular. □ 
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To make these definitions more concrete, we can consider a few examples. 



Popular pair- wise terms of the form f^ (x i3 Xj) 



Xj\ (also known as 



are both convex and therefore multilabel 
[or truncated) version of these t p terms: 



and the £ 2 ' <Pij x j) = ( x i ~ x jY 
submodular. However, the robust 
ifij fa, Xj) = min{|a^ — Xj\ p , r}, is no longer multilabel submodular. 

An important result regarding the minimization of multilabel submodular 
functions is presented in Schlesinger and Flach (2006). A reduction is made 



from submodular minimization to st-mincut on a specially constructed graph. 
It is shown that when the original energy is multilabel submodular all weights 
in the resulting graph are non-negative and hence a global minimum can be 
found in polynomial time. This construction generalizes the construction of 



Ishikawa (2003) that is specific to convex pair- wise functions. 



However, submodularity of F (x) is a very restrictive property. The well 
known Potts term, and many other pair-wise interactions are not submodu- 
lar. Still, there are other important properties for non-submodular functions 
F(x). |Boykov et al\ (2001) derived important approximations that rely on 



other properties of F (x) . These properties were further relaxed by Kol 



mogorov and Zabih (2004): 



Definition 4 (Relaxed metric). 

Vi, j G £ and Va, /?, 7 



A function F (x) is a relaxed metric iff 



ifij (a, a) + ifij (/3, 7) < fij (/3, a) + f { j (a, 7) 



(1.14) 



The condition of Def . [4] resembles the triangle inequality of metric func- 
tions in the case fij{a ) a) = 0. The Potts model and the robust (truncated) 
£1 interaction, mentioned earlier in this section are examples of relaxed-metric 
pair-wise interaction. Note that this property of relaxed metric is different 
than the convexity of Def. |3| Another property, less restrictive than relaxed 
metric is: 

Definition 5 (Relaxed semi-metric). A function F(x) is a relaxed semi- 
metric iff Mi,] G £ andMa^fi 

fij (a, a) + if^ (/3, p) < f^ (/?, a) + f {j (a, (3) (1.15) 



Examples of relaxed semi-metric functions: 1%, truncated £ 2 - Clearly, any 
relaxed metric function is also a relaxed semi-metric. At this point it may be 
useful to get some intuition about the meaning of the semi-metric property: 
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convex submodular relaxed metric relaxed semi-metric 

t\ t\ truncated £\ £ 2 

£ 2 £2 £\ truncated £ 2 

Potts 

Tab. 1.2: examples of different types of pair-wise functions tpij 



According to Def. [5] a function F (x) is semi-metric if the cost of assigning 
neighboring variables i and j to the same label (either a or 0) is never greater 
than the cost of assigning them to different labels. This property implies that 
F (x) encourages smoothness of the solution x. 



Figure |1.4| shows the relation between the different types of functions 
(fij. The most restrictive type is the convex (Def. [3]) which is a subset of 
submodular (Def. [2]). Regarding relaxed metric and submodular: there are 
submodular functions that are not relaxed metric (e.g., £ 2 ), and there are re- 
laxed metric that are not submodular (e.g., Potts). Table [L2| shows examples 



of popular pair-wise functions and their properties. 

Claim 1.2. In general, multilabel submodular (Def.^ is not metric (Def.^). 

Proof. Let cpij be multilabel submodular. Consider three labels a < 7 < S. 
Choose /3 — 7. We now have a < 7 and /3 < 5. Submodularity of cpij (Def. [2^ 
yields: 

(fij (a, 0) + (fij (7, S) < ip i:j (a, S) + (fij (7, 0) 
<Pij ( a > 7) + <Pij (7, 5) < <Pij 0, S) + (fij (7, 7) 



This inequality is the opposite of the inequality of Def. ^ (semi-metric). In 
general, Equality does not hold and therefore most submodular (p^ are not 
metric. However, for ^1, i.e., (p^ {x^Xj) = \xi — Xj\ equality holds and thus 
£\ is a special case of (p^ that is both submodular and metric. □ 



Unfortunately, the promising results of Schlesinger and Flach (2006) re- 



garding the global minimization of submodular functions does not hold for 
more general functions. When F (x) is no longer submodular one can no 
longer hope to achieve global optimality in polynomial time. However, for 



relaxed metric and relaxed semi-metric functions Boykov et al. (2001) showed 



large move making approximate algorithms that performs quite well in prac- 
tice (see e.g., Szeliski et al. (2008)). Large move making algorithms iteratively 
seek to improve the energy of a current solution by updating large number 
of variables at once. Each such large step is carried out by solving a simple 
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Fig. 1.4: Different types of multilabel F(x): The hierarchy and relations be- 
tween different types of multilabel energies. Green indicates the existence 
of global minimization algorithms. For energies in red there are good 
approximation algorithms. 



binary submodular minimization via st-mincut. For relaxed metric functions 
the large move is a-expansion. At each iteration a binary problem is solved: 
for each variable it can either retain its current label (0) or switch label to a 
(1). The relaxed metric property of F (x) ensures that the resulting binary 
problem is submodular and thus can be solved globally in polynomial time. 
The a-expansion algorithm iterates over all labels until it converges. Con- 
vergence after finite number of iterations is guaranteed, and in certain cases 
some theoretical bounds can be proven on the quality of the approximation 
(see Boykov et al. (2001) for more details). 

For relaxed semi-metric functions a slightly different large move is devised. 
For each pair of labels a and /3 the large move is called a/3-swap. At each 
iteration a binary problem is solved for all variables currently labeled either 
a or /3: for each variable it can pick either a (0) or /3 (1). The relaxed 
semi-metric property of F (x) ensures that the resulting binary problem is 
submodular and thus can be solved globally in polynomial time. The a/3- 
swap algorithm iterates over all pairs of labels until it converges. Convergence 
after finite number of iterations is guaranteed, however, theoretical bounds on 
the approximation no longer exists (see Boykov et al. (2001) for more details). 
Table |1.3| shows the resulting types of pair- wise energies and the current 
results on their minimization. This thesis focuses on the hard optimization 
of the contrast-enhancing functionals defined on cyclic graphs. Part \U\ shows 
how introducing energies that contain contrast-enhancing terms gives rise to 
new applications. While Part III deals with the methods of approximating 
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^^^^^^^^ structure 


Tree 


Cyclic 


pair-wise ^^^^^^^^ 




submodular 


Easy: mincut, BP 


Easy: mincut 


semi-metric 


Easy: BP 


Good Approximations 


contrast-enhancing 


Easy: BP 


Hard 



Tab. 1.3: Hardness of optimization (multilabel): the computational "hard- 
ness" of discrete optimization as a function of the underlying graph 
structure and the class of pair-wise interactions. 



these challenging contrast-enhancing energies. 



1.5 Relation to Linear Programming (LP) 
This section establishes a connection between the pair- wise energy minimiza- 



tion problem (1.10) and the field of convex optimization, in particular to Lin- 



ear Programming (LP). For the following discussion it is useful to introduce 
some new notations and definitions. The first useful representation is the 
overcomplete representation of the solution vector x. This representation is 



defined as follows Wainwright et al. (2005); Wainwright and Jordan (2008): 



Definition 6 (Overcomplete representation). A discrete solution x can be 
represented by an extended binary vector 0(x), s.t. 



( X )z,c = 8 ( X i == a ) 

( x )u>/3 = 5 ( x i ==<*)• 6 (Xj == p) 



(1.16) 
(1.17) 



where S(-) is the Kronecker delta function. 



The overcomplete representation projects a discrete vector x of dimension 
n into a d-dimensional binary vector (j) (x) . The index set of vector cj) (x) is 
defined as I = {i, a} U {ij, a/3}, with d— \X\. 

With the overcomplete representation in mind, it is useful to parameterize 
the discrete function F (x) of Eq. (1.1) using a parameter vector 9: 



0i,a = Pi (a) 



(1.18a) 



0ij,aP = <Pij {OL,/3) 



(1.18b) 
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Combining Def. [6| with the parametrization of (1.18), the functional of 



Eq. (1.1) becomes 



= <M(x)> 



(1.19) 



The overcomplete representation (j) (x) is defined over a discrete set of 
points. However, it is useful to consider a relaxation of this set into a convex 
continuous domains. 

The tightest relaxation of the discrete set {(/) (x) |x £ {1, . . . , I} 71 } is the 
marginal polytop M (£): 

Definition 7 (Marginal polytop). The marginal polytop of F (x) is the con- 
vex combination of the vertices (j) (x) for the l n discrete solutions x. This set 
is formally defined as: 



fi = ^^p (x) 4> (x) , for some distribution p(-) > (1-20) 



This marginal polytop, M (£), is defined by finite, yet exponentially large, 
number of half-spaces. Therefore, it is convenient to define a relaxed version 
of the marginal polytop: 

Definition 8 (Local polytop). The local polytop of F (x) is the convex set: 



L (S) = { r e 



Ea n,a = 1 Vi 



(1.21) 



Unlike the marginal polytop, L (£ ) is defined using only polynomial num- 
ber of half-spaces, and therefore it admits polynomial time optimization 



schemes. In fact, L (£) is the first order approximation of M (£) ( Wainwright 
and Jordan| ( [2008] )) 



Note that the geometry of M {£) and L (£ ) are affected by the number 
of variables n, the number of states / and by the underlying graph structure 
£ defining the interacting pairs of variables. These polytops are not affected 
by the parameters of tpi and cpij. 

By standard properties of LP, the optimal value is attained at an extreme 
point of the constraint set (a vertex of the constraints polytop). The marginal 
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(a) 




Fig. 1.5: The Marginal and Local polytops: An illustration of the marginal, 
M (£), and local, L {£), polytops. (a)JL {£) is an outer bound on the exact 
constraints o/M (£). The vertex set o/L (£) includes all integral vertices 
o/M(£) (marked in red) . It also includes fractional vertices (marked in 
black), which are not vertices o/M(£). (b), (c) Solving an LP with cost 
vector 9 entails translating a hyperplane with normal 9 until it is tangent 
to the constraint set. In (b) the point of tangency of cost vector 9\ occurs 
at an integral vertex. In (c) the point of tangency of cost vector 92 occurs 
at a fractional vertex, outside M(£). In this case, there is an integrality 
gap. (figure taken from Wainwright et al. (200ty). 



polytop, M(£), is a convex set defined by all the possible solutions 0(x). 
Hence, a vertex of M {£) corresponds to a vector (j) (x) for some discrete 
solution x. We refer to these vertices as integral vertices corresponding to 
integral solutions of the LP. On the other hand, the local polytop, L(£), 
may contain more vertices that do not correspond to any discrete solution, 
x. We refer to these vertices as fractional solutions. Fig. |1.5| provides an 
illustration of the marginal and local polytops, the relation between them, 
and their impact on the optimal solution of LP. The figure also distinguishes 
between the integral and fractional vertices of the polytops. (Wainwright 



et al., 2005, Example 3) describes in detail a case where fractional solution 
is optimal. It is important to note that the parameter vector 9 for which 
the fractional solution is optimal, in their example, is such that encourages 
contrast between variables (i.e., Xi ^ xj for neighboring i and j). 

The relation between discrete energy minimization (problem (1.10)) and 
convex LP presented in this section lies in the foundation of popular optimiza- 
tion algorithms such as tree-reweighted BP (see Sec. 1.6.3). This relation is 
also important to illustrate the challenging task of optimizing (1.10): When 9 
represents an energy function that is "smoothness-encouraging" its optimal 
value usually corresponds to an integral vertex of L {£ ) and thus its opti- 
mization can be done exactly via LP over the relaxed constraint set L(£). 
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However, when the energy has contrast-enhancing terms its optimal solution 
w.r.t L (£) is fractional - the global integral solution cannot be attained us- 
ing the relaxed constraints of L (£). Therefore, the optimization of energies 
that have contrast-enhancing terms, is a very challenging task. This thesis 
focuses on these energies and proposes methods to cope with this inherent 
difficulty. 



1.6 Discrete Optimization Algorithms 

In the previous sections we outlined some key properties of the energy func- 
tion F (x) and their effect on the minimization process. We also demon- 
strated how specific optimization algorithms take advantage of these prop- 
erties. In this section we present several prominent optimization algorithms 
that we refer to later on in this thesis. These selected representative ap- 
proaches sketches the main directions at which current discrete optimization 
research is mainly focused. 



1.6.1 Iterated Conditional Modes (ICM) 



ICM is a very simple and basic iterative optimization algorithm proposed 
by |Besag| ( p^86| . It is an approximate method, acting locally on the vari- 
ables, suitable for multilabel functions with arbitrary underlying graph and 
arbitrary (fij. At each iteration ICM visits all the variables sequentially, and 
choose for each variable the best state (with the lower energy) given the cur- 
rent states of all other variables. This process can be viewed as a greedy 
coordinate descend algorithm and it bears some analogy to Gauss-Seidel re- 



laxations of the continuous domain (Varga (1962)). 

ICM is a local update process and therefore is prone to getting stuck very 
fast in local minimum. It is also extremely sensitive to initialization (see e.g., 
Szeliski et al] ( [20081 )). 

When taking a probabilistic point of view, and considering the energy 
function as a Gibbs energy, that is representing some measure over all possible 
solutions: 

Pr (x)ocexp(- ±F (x)) (1.22) 

ICM may be viewed as a Gibbs sampler at the temperature limit T 
0. Therefore, its performance is expected to be inferior to more sophisti- 
cated sampling methods such as, e.g., simulated annealing (Kirkpatrick et 

Z]( [T983l )). 
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1.6.2 Belief Propagation (BP) 

Belief-propagation is an optimization algorithm based on local updates. How- 
ever, in contrast to the hard assignment ICM performs at each update, BP 
maintains "soft" beliefs for each variable and passes messages between neigh- 
boring variables according to their current belief. A message from variable i 
to its neighbor j, m^j, is a vector of length L. That is, the message vector 
encodes how i "feels" about assigning state a to j. 



^3 



(a) = mm < ^ (/?) + ^ m k ^i + <p y (/?, a) 

ki^S ,k^j 



(1.23) 



The belief of each variable i is also a vector of length L encoding the 
tendency of i to be assigned to state a: 



bi (a) = (fi (a) + ^ m k^i W 

kieS 



(1.24) 



BP iteratively passes messages and updates the local belief for each vari- 
able. After its final iteration, each variable is assigned the label with the 
lowest energy, i.e., Xi = argmin a bi(a). 

Originally, BP was used as an inference algorithm in tree-structured 
graphical models (Pearl (1988); Koller and Friedman ( 2009| )). Messages were 
initialized to zero. Then messages were passed from leafs to root and back 
to the leafs. This forward-backward message passing converges to the global 
optimum when S is a tree, regardless of the type of cpij that can be arbitrary. 

When the underlying graph £ has cycles, BP is no longer guaranteed to 
converge. It was proposed to run BP on cyclic graphs, a variant called loopy- 
BP. In the loopy case, however, it is not clear how to schedule the messages 
and how to determine the number of iterations to perform. Even if the loopy 
BP converges to some fixed point, it is usually a local optimum with no 



guarantees on global optimality (Koller and Friedman (2009); Wainwright 
and Jordan| pt^ ) 



1.6.3 Tree-reweighted Belief Propagation (TRW) 



et al. 


( 


2005 


); 


Kolmogorov 


( 


2006 


); 


Werner 


( 


2007 


); 


Wainwright and Jordan 


(2008 


); 


Komodakis et al 


( 


2011). 


These works proposed a new interpretation 



to the basic message passing operation that BP conducts. They relaxed 
the discrete minimization of F (x) to form a continuous linear programming 



(LP), in the same manner that was presented in Sec. 1.5, Then they related 
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message passing to the optimization of the resulting LP. a It was shown that 
the relaxation of F (x) forms an LP with very specific structure. This special 
structure can, in turn, be exploited to devise a specially tailored optimization 
scheme that uses message passing as a basic operation. 

The tree-reweighted BP approach establishes a relation between discrete 
optimization and continuous convex optimization of LP. This relation brings 
forward interesting results and properties from the continuous optimization 
domain to the discrete one. For instance, it allows to use Lagrangian mul- 
tipliers and formulate a Lagrangian dual to the original problem. The dual 
representation provides a lower bound to the sought optimal solution. If a 
solution x* is found with an energy F (x*) equals to the lower bound, then 
a certificate is provided that this x* is a global minimum. 



It was shown (e.g., Szeliski et al. (2008)) that in practice in many com- 
puter vision application TRW was able to recover globally optimal solutions. 
These results dealt mainly with relaxed metric energies (see Sec. 1.4). How- 
ever, there is no general guarantee on TRW and there are cases involving 
challenging energies, beyond relaxed metric, for which it was shown that an 
integrality gap exists and TRW can no longer provide a tight approximation 
(e.g., Kolmogorov ( [20061 ; [Bagon and Gahm] ( [20lT] )). 



1.6.4 Large Move Methods 
As opposed to local methods such as ICM, BP and TRW, there is the ap- 



proach of Boykov et al. (2001) that proposes discrete methods based on 



combinatorial principles. The basic observation that lies at the heart of the 
large-move algorithms is that instead of treating the variables locally one at 
a time, one may affect the labeling of many variables at once by performing 
large moves. These large moves are formulated as a binary step, and the 
difference between the different "flavors" of the large-move algorithms is the 
formulation of these binary steps. What makes these large move effective 
and efficient is the fact that binary submodular sub-problems can be solved 
globally and efficiently. 

The two basic large move algorithms, a-expand and a/3-swap, were al- 
ready described in Sec. |1.4| in the context of relaxed metric and relaxed 
semi metric energies. Recently, another large move making algorithm called 
fusion-moves was proposed (Lempitsky et al. (2007, 2010| )) At each iteration 
of the fusion algorithm a discrete solution is proposed. The proposed solution 
is fused into the current solution via a binary optimization: each variable can 
retain its current label (0), or switch to the respective label from the pro- 
posed solution (1). However, unlike the swap and expand algorithms, the 
resulting binary optimization of the fusion step is no longer guaranteed to be 
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submodular and highly depends on the types of proposed solutions. There- 
fore, it is often the case that QPBO (Kolmogorov and Rother ( 2007| )), which 
is a non-submodular binary approximation algorithm, is used to perform the 
binary steps of the fusion algorithm. 



To summarize this brief outline of existing approximation algorithms one 
may notice that a lot of effort is put in recent years in developing and im- 
proving approximate optimization algorithms. Research is put into both 
providing better practical results and into exploring the theoretic aspects of 
the problem. Approximation methods are derived and inspired by both the 
continuous optimization domain (e.g., TRW) and the discrete domain (e.g., 
graph-cuts). However, these results mainly focus on the minimization of func- 
tions F (x) that have some structure to them: either relaxed metric or relaxed 
semi-metric (see, for example, the survey of |Szeliski etd\ ([2008)). For these 
smoothness-encouraging functions current algorithms succeed in providing 
good approximations in practice, despite their theoretical NP-hardness. In 
contrast, when it comes to arbitrary, contrast- enhancing functions, little is 
known in terms of approximation and no method currently exists (to the 
best of our knowledge) that provides satisfying approximations. This thesis 
focuses on the optimization of arbitrary, contrast-enhancing functions. 



2. OUTLINE OF THIS THESIS 



Discrete energy minimization is a ubiquitous task in computer vision and in 
other scientific domains. However, in the previous chapter we saw that the 
optimization of such discrete energies is known to be NP-hard in most cases 



(Boykov et al. (2001)). Despite this theoretical hardness, for many "smooth- 
ness encouraging" energies (relaxed semi-metric), approximate optimization 
algorithms exist and provide remarkable approximations in practice. 

However, as tasks become more sophisticated, the models grow more 
complex: From tree structured to cyclic graphs and grids, and from simple 
smoothing priors to complex arbitrary pair- wise interactions. As the energies 
become less constrained and structured one gains more expressive power for 
the objective function at the cost of a significantly more challenging opti- 
mization task. 

In this work I would like to step outside this "comfort-zone" of the 
smoothness-encouraging energies and explore more challenging discrete en- 
ergies. This step gives rise to two important questions: 

1. Why bother? Why should one consider energy functions beyond semi- 
metric? What can be gained (in term of expressive power) considering 
the significant hardness of the entailed optimization task? 

2. In case we decide to embark on this challenging task of approximating 
arbitrary discrete energy, how can we tackle this problem? Can we 
propose new approaches and directions for the difficult approximation 
tasks of discrete energies, beyond semi-metric? 

These two research questions provide the road map of this thesis. Con- 
sequently, this thesis revolves around two major axes: applications and ap- 
proximations. 



Applications 

The first axis of this thesis involves exploring new applications that require 
arbitrary, contrast-enhancing energies beyond semi-metrics. These examples 
demonstrate how the additional expressive power of arbitrary energies is 
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crucial to derive new applications. We show how utilizing arbitrary energies 
gives rise to interesting and desirable behaviors for different applications. We 
present these new applications in Part [IT] of this thesis. 

Chapter [3] shows an image sketching application that provides a binary 
sketch from a small collection of images of similar objects. In this applica- 
tion the binary sketch is described via the interactions between neighboring 
pixels in the corresponding images. Neighboring sketch bits corresponding 
to similar image pixels are encourage to have similar value (i.e., submodu- 
lar, smoothness-enhancing term). In contrast, neighboring sketch bits cor- 
responding to dissimilar image pixels are encourage to have different value 
(i.e., non-submodular, contrast-enhancing term). The binary sketch is then 
the output of the resulting non-submodular energy minimization. 

The sketching application may be thought of as a special case of binary 
image segmentation. Considering contrast-enhancing objective function for 
the task of unsupervised segmentation or clustering may introduce a solution 
not only to the clustering problem, but also may help in determining the un- 
derlying number of clusters. This clustering objective function is commonly 
known as "Correlation Clustering" (Bansal et al ( 2004| )). Chapter [3] explores 
the Correlation Clustering functional and its underlying "model-selection" 
capability. 

Image segmentation and clustering are not the only examples for contrast- 
enhancing energies. Chapter [5] describes a 3D surface reconstruction from 
multiple images under different known lighting. The reconstruction takes 
into account the changes in appearance of the surface due to the change in 
lighting directions. These changes amounts to an implicit partial differen- 
tial equation (PDE) that describes the unknown surface. In this work we 
propose to pose the solution of the resulting PDE as a discrete optimization 
task. Incorporating integrability prior on the unknown surface, the resulting 
discrete energy has contrast-enhancing terms. 

Modeling and prior knowledge are not the only sources for contrast- 
enhancing terms in energies. In many cases, the exact parameters of an 
energy are learned from training data. Chapter [6] presents Decision Tree 
Fields (DTF): an example of such a learning scheme. DTF learns, in a prin- 
cipled manner, a pair- wise energy from labeled training examples. Since the 
learning process is not constraint to smoothness-encouraging energies, it is 
often the case that the resulting energy has contrast-enhancing terms. In 
its training phase DTF seeks parameters that (approximately) maximize the 
likelihood of the data. Therefore, the resulting contrast-enhancing terms are 
better suited to describe the underlying "behavior" of the data. Restricting 
the model to smoothness-encouraging terms only would prohibit DTF from 
accurately predicting results at test time. 
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Approximate Optimization 

The enhancement in descriptive power gained by considering arbitrary ener- 
gies comes with a price tag: we no longer have good approximation algorithms 
at hand. Therefore, the second axis of this thesis explores possible directions 



for approximating the resulting challenging arbitrary energies. In part [TTT 
of this work I propose practical methods and approaches to approximate 
the resulting NP-hard optimization problems. In particular, in Chapter [7j I 
propose a discrete optimization approach to the aforementioned correlation 
clustering optimization. This approach scales gracefully with the number 



of variables, better than existing approaches (e.g., |Vitaladevuni and Basri 
( [20101 ). 



Chapter [8] concludes this part with a more general perspective on discrete 
optimization. This new perspective is inspired by multiscale approaches and 
suggests to cope with the NP-hardness of discrete optimization using the 
multiscale landscape of the energy function. Defining and observing this 
multiscale landscape of the energy, I propose methods to explore and exploit 
it to derive a coarse-to-fine optimization framework. This new perspective 
gives rise to a unified multiscale framework for discrete optimization. Our 
proposed multiscale approach is applicable to a diversity of discrete ener- 
gies, both smoothness-encouraging as well as arbitrary, contrast-enhancing 
functions. 
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This part concentrate on the first axis of this thesis. This direction 
explores new applications which require arbitrary energies. We start with 
an unsupervised clustering objective function, Correlation Clustering (CC). 
Chapters [3] and [4] show several applications all revolving around the cor- 
relation clustering energy. This energy is hard to optimize: It has both 
smoothness-encouraging as well as contrast ive pair- wise terms; it has no data 
term to guide the optimization process, and the number of discrete labels is 
not known a-priori. We analyze an interesting property of the correlation 
clustering functional: its ability to recover the underlying number of clus- 
ters. This interesting property is due mainly to the usage of terms that 
enhance contrast, rather than smoothness, in the solution x. 

Another application that requires arbitrary energy is 3D reconstruction 
of surfaces. We show, in chapter [5| how under certain conditions 3D recon- 
struction may be posed as a solution to a partial differential equation (PDE). 
Solving this PDE to recover the 3D surface can be done through discretiza- 
tion of the solution space. The discrete version of the PDE yields pair-wise 
terms that are beyond semi-metric. 

Finally, the parameters of the energy function defining the terms cpi and 
(fij may not be fixed a-priori. It may happen that one would like to learn the 
energy function from training data for various applications. Chapter [6] shows 
an example of such an energy learning framework. The resulting learned 
energy is no longer guaranteed to be "well-behaved". In fact, experiments 
show that when the learning procedure is not constrained it is often the case 
that the resulting energy is arbitrary and does not yield any known structure. 



3. SKETCHING THE COMMONH 



Given very few images containing a common object of interest under severe 
variations in appearance, we detect the common object and provide a com- 
pact visual representation of that object, depicted by a binary sketch. Our 
algorithm is composed of two stages: (i) Detect a mutually common (yet 
non-trivial) ensemble of 'self-similarity descriptors' shared by all the input 
images, (ii) Having found such a mutually common ensemble, 'invert' it to 
generate a compact sketch which best represents this ensemble. This pro- 
vides a simple and compact visual representation of the common object, while 
eliminating the background clutter of the query images. It can be obtained 
from very few query images. Such clean sketches may be useful for detection, 
retrieval, recognition, co-segmentation, and for artistic graphical purposes. 

The 'inversion' process that generates the sketch is formulated as a dis- 
crete optimization problem of a binary, non-submodular energy function. 



3.1 Introduction 

Given very few images (e.g., 3-5) containing a common object of interest, 
possibly under severe appearance changes, we detect the common object and 
provide a simple and compact visual representation of that object, depicted 



by a binary sketch (see Fig. 3.1). The input images may contain additional 
distracting objects and clutter, the object of interest is at unknown image 
locations, and its appearance may significantly vary across the images (dif- 
ferent colors, different textures, and small non-rigid deformations). We do 
assume, however, that the different instances of the object share a very rough 
common geometric shape, of roughly the same scale (±20%) and orientation 
(±15°). Our output sketch captures this rough common shape. 

The need to extract the common of very few images occurs in various 
application areas, including: (i) object detection in large digital libraries. 
For example, a user may provide very few (e.g., 3) example images containing 



1 This is joint work with Or Brostovsky, Meirav Galun and Michal Irani. It was pub- 
lished in the 23 rd International Conference on Computer Vision and Pattern Recognition 
(CVPR), |20lO 
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Fig. 3.1: Detecting and sketching the common: (a) The 4 input images pro- 
vided to the algorithm, (b) The least trivial common part (the heart) is 
detected and sketched by the algorithm. 



an object of interest with varying appearances, and wants to retrieve new 
images containing this object from a database, or from the web. (ii) Co- 
segmentation of a few images, (iii) Artistic graphical uses. 

Our method is based on densely computed Local Self- Similarity Descrip- 
tors Shechtman and Irani (2007). Our algorithm is composed of two main 
steps: (i) Identify the common object by detecting a similar (yet "non- 
trivial") ensemble of self- similarity descriptors^ that is shared by all the input 
images. Corresponding descriptors of the common object across the different 
images should be similar in their descriptor values, as well as in their relative 
positions within the ensemble, (ii) Having found such a mutually common 
ensemble of descriptors, our method "inverts" it to generate a compact binary 
sketch which best represents this ensemble. 

It was shown in Shechtman and Irani (2007) that given a single query 
image of an object of interest (with very little background clutter), it is 
possible to detect other instances of that object in other images by densely 
computing and matching their local self-similarity descriptors. The query 
image can be a real or synthetic image, or even a hand-drawn sketch of the 
object. 

In this paper we extend the method of |Shechtman and Irani (2007) to 
handle multiple query images. Moreover, in our case those images are not 
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centered around the object of interest (its position is unknown), and may 
contain also other objects and significant background clutter. Our goal is to 
detect the "least trivial" common part in those query images, and generate as 
clean as possible (region-based) sketch of it, while eliminating the background 
clutter of the query images. Such clean sketches can be obtained from very 
few query images, and may be useful for detection, retrieval, recognition, and 
for artistic graphical purposes. Some of these applications are illustrated in 
our experiments. 



Moreover, while Shechtman and Irani (2007) received as an input a clean 



hand-drawn sketch of the object of interest (and used it for detecting other 
instances of that object), we produce a sketch as one of our outputs, thereby 
also solving the "inverse" problem, namely: Given several images of an ob- 
ject, we can generate its sketch using the self-similarity descriptor. 

A closely related research area to the problem we address is that of 'learn- 
ing appearance models' of an object category, an area which has recently 



received growing attention (e.g., 


Chum and Zisserman 


(2007 


); 


Chum et al. 


02OO9D; Ferrari et al. (2009); Winn and Jojic 


(2005); Karlinsky et al. 


(2008); 


Lee and Grauman (2009); 


Nguyen et al. 


(2009); 


Wu et al. (2009); Zhu et al. 


(2008), to name just a few) 


. The goal of these methods is to discover common 



object shapes within collections of images. Some methods assume a single 



2009 



Zhu et al. 



object category (e.g., Chum and Zisserman (2007); Ferrari et al. (2009); Kar 



linsky et al\ ([20081) ; |Winn and Jojic| fl2005[ ); [Nguyen et al.\ Q2009[ ); |Wu etal. 



( 20081)), while others assume multiple object categories 



(e.g., Chum et al. ( 2009| ); [Lee and Grauman ( 2009| )). These methods, which 



rely on weakly supervised learning (WSL) techniques, typically require tens 
of images in order to learn, detect and represent an object category. What 
is unique to the problem we pose and to our method is the ability to depict 
the common object from very few images, despite the large variability in its 
appearance. This is a scenario no WSL method (nor any other method, to 
our best knowledge) is able to address. Such a small number of images (e.g., 
3) does not provide enough 'statistical samples' for WSL methods. While our 
method cannot compete with the performance of WSL methods when many 
(e.g., tens) of example images are provided, it outperforms existing meth- 
ods when only few images with large variability are available. We attribute 
the strength of our method to the use of densely computed region-based in- 
formation (captured by the local self-similarity descriptors), as opposed to 
commonly used sparse and spurious edge-based information (e.g., gradient- 
based features, SIFT descriptors, etc.) Moreover, the sketching step in our 
algorithm provides an additional global constraint. 

Another closely related research area to the problem addressed here is 'co- 



segmentation' (e.g., Rother et al. (2004b); Bagon et al. (2008); Mukherjee et 
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Fig. 3.2: The Local Self Similarity Descriptor: (Figure taken from Shechtman 
|and Irani ( 2007[ ).) The self- similarity descriptor for any given point 
(e.g., the green point in the left image), is computed by measuring the 
similarity of a 5 x 5 patch around the point with the surrounding 60 x 60 
image region. This results in a 'correlation' surface (middle image). The 
correlation surface is quantized into a compact log-polar representation 
of 45 bins (15 angles, 3 radial intervals) to achieve invariance against 
small local affine and non-rigid deformations. The maximum value in 
each bin constitutes the value at the corresponding descriptor entry (right 
most image). 



al. ( 2009[ )). The aim of co-segmentation is to segment out an object common 
to a few images (2 or more), by seeking segments in the different images that 
share common properties (colors, textures, etc.) These common properties 
are not shared by the remaining backgrounds in the different images. While 
co-segmentation methods extract the common object from very few images , 
they usually assume a much higher degree of similarity in appearance between 
the different instances of the object than that assumed here (e.g., they usually 
assume similar color distributions, similar textures, etc.) 

The rest of the paper is organized as follows: Sec. |3.2| formulates the 



problem and gives an overview of our approach. Sec. |3.3| describes the com- 
ponent of our algorithm which detects the 'least trivial' common part in a 



collection of images, whereas Sec. 3.4 describes the sketching component of 



our algorithm. Experimental results are presented in Sec. 3.5 



3.2 Problem Formulation 

Let 7i, be K input images containing a common object under widely 

different appearances. The object may appear in different colors, different 
textures, and under small non-rigid deformations. The backgrounds are ar- 
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bitrary and contain distracting clutter. The images may be of different sizes, 
and the image locations of the common object are unknown. We do assume, 
however, that the different instances of the object share a very rough com- 
mon geometric shape, of roughly the same scale and orientation. Our output 
sketch captures this rough common shape. 

Our approach is thus based on detecting 'common regions' (as opposed 
to 'common edges'), using densely computed Local Self- Similarity Descrip- 
tors Shechtman and Irani (2007). This descriptor (illustrated in Fig. 3.2) 



captures local shape information in the image vicinity where it is com- 
puted, while being invariant to its photometric properties (color, texture, 
etc.) Its log-polar representation makes this descriptor insensitive to small 
affine and non-rigid deformations (up to ±20% in scale, and ±15°). It was 
further shown by Horster et al\ ( 2008| ) that the local self-similarity descriptor 
has a strong descriptive power (outperforming SIFT). The use of local self- 
similarity descriptors allows our method to handle much stronger variations 
in appearance (and in much fewer images) than those handled by previ- 
ous methods. We densely compute the self-similarity descriptors in images 
h, Ik ( a t every 5-th pixel). 'Common' image parts across the images will 
have similar arrangements of self similarity descriptors. 

Let ci, ck denote the unknown locations of the common object in the 
K images. Let I^ k denote a w x h subimage of Ik centered at c^, containing 
the common object (k = 1,...,X) (need not be tight). For short, we will 
denote it by Ik- The sketch we seek is a binary image S of size w x h which 
best captures the rough characteristic shape of the common object shared by 
ii, Ik- More formally, we seek a binary image S whose local self-similarity 
descriptors match as best as possible the local self-similarity descriptors of 
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ii, Ik- The descriptors should match in their descriptor values, as well as 
in their relative positions with respect to the centers {ck}'. 



K 



Score(S\Ii, Ik) = match(S, h) (3.1) 

k=i 

K wh 
k=l i=l 

where df is the z-th self-similarity descriptor computed at image location li 
in the sketch image S, d^ is the self-similarity descriptor computed at the 
same relative position l{ (up to small shifts) in the w x h subimage and 
sim{di ) d 2 ) = — || d\ — d 2 \\ p measures how similar two descriptor vectors are 
(we experimented with L p norms for p = 1,2). Thus, the binary sketch we 
seek is: 

S = argmax{Score(S\Ii, Ik)} s -t- S(l) £ { — 1,1} (3.2) 
where S(l) is the value of S at pixel /. This process is described in detail in 



Sec. |3.4[ and results in a sketch of the type shown in Fig 3.3 



While edge-based detection and/or sketching Lee and Grauman (2009); 



Zhu et al (2008); Ferrari et al (2009) requires many input images, our region- 
based detection and sketching can be recovered from very few images. Edges 
tend to be very spurious, and are very prone to clutter (even sophisticated 



edge detectors like Maire et al ( |2008D - see Fig. |3l|b). Edge-based ap 



proaches thus require a considerable number of images, to allow for the con- 
sistent edge/gradient features of the object to stand out from the inconsistent 
background clutter. In contrast, region-based information is much less sparse 
(area vs. line-contour), less affected by clutter or by misalignments, and is 
not as sensitive to the existence of strong clear boundaries. Much larger 
image offsets are required to push two corresponding regions out of align- 
ment than to misalign two thin edges. Thus, region-based cues require fewer 
images to detect and represent the common object. Indeed, our method 
can provide good sketches from as few as 3 images. In fact, in some cases 
our method produces a meaningful sketch even from a single image, where 
edge-based sketching is impossible to interpret - see example in Fig. |3.4[ 

In the general case, however, the locations Ci, ...,cx of the object within 
the input images ii, are unknown. We seek a binary image S which 

sketches the 'least trivial' object (or image part) that is 'most common' to 
all those images. The 'most common' constraint is obvious: in each image Ik 
there should be a location c& for which match (5, I^ h ) is high (where Ik = I°k 
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Fig. 3.4: Regions vs. Edges: (a) a single input image. (b) The edge map 
generated by the method of \M aire et al. (2008). (c) The binary sketch 
generated by our method when applied to the single input image (using 
all the self- similarity descriptors densely computed in that image). This 
illustrates the concept that region-based information is much richer than 
sparse edge-based information, and therefore appears to be more powerful 
for detection and for sketching. 



is the subimage centered at c&). However, there are many image regions that 
are trivially shared by many natural images. For example, uniform regions 
(of uniform color or uniform texture) occur abundantly in natural images. 
Such regions share similar self-similarity descriptors, even if the underlying 
textures or colors are different (due to the invariance properties of the self- 
similarity descriptor). Similarly, strong vertical or horizontal edges (e.g., at 
boundaries between two different uniformly colored/textured regions) occur 
abundantly in images. We do not wish to identify such trivial (insignificant) 
common regions in the images as the 'common object'. 

Luckily, since such regions have good image matches in lots of locations, 
the statistical significance of their good matches tends to be low (when mea- 
sured by how many standard deviations its peak match values are away 
from its mean match value in the collection of images). In contrast, a non- 
trivial common part (with non-trivial structure) should have at least one good 
match in each input image (could also have a few matches in an image), but 
these matches would be 'statistically significant' (i.e., this part would not be 
found 'at random' in the collection of images). 

Thus, in the general case, we seek a binary sketch S and locations ci, ck 
in images ii, such that: 

(i) S is 'most common', i n the sense that it maximizes Score{S\Il 1 , .., I c ^) = 
J2k=i match(S, I c k k ) of Eq. 

(ii) S is 'least trivial', in the sense that its matches at ci, ck are statis- 
tically significant, i.e., it maximizes Ylk=i StatSignificance (match(S, I^ k )), 
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where the significance of a match of S is measured by how many standard 
deviations it is away from the mean match value of S. 

Our optimization algorithm may iterate between these two constraints: 
(i) Detect th e lo cations {ck}^ =1 of the least trivial common image part in 
{Ik}k=i (Sec. 3.3). (ii) Sketch the common object given those image locations 
(Sec. |3.4 ). The overall process results in a sketch image, which provides 
a simple compact visual representation of the common object of interest in 
a set of query images, while eliminating any distracting background clutter 
found in those images. 



3.3 Detecting the Common 

We wish to detect image locations ci, ck in 7i, Ik, such that correspond- 
ing subimages centered at those locations, i£ fc , share as many self-similarity 
descriptors with each other as possible, yet their matches to each other are 
non-trivial (significant). The final sketch S will then be obtained from those 



subimages (Sec. 3.4). 



Let us first assume that the dimension w x h of the subimages is given. We 
will later relax this assumption. Let I be a w x h image segment (this could 
be the final sketch S, or a subimage extracted from one of the K input images 
in the iterative process). We wish to check if I has a good match in each 
of the input images ii, and also check the statistical significance of 

its matches. We 'correlate' / against all the input images (by measuring the 
similarity of its underlying self-similarity descriptor^]). In each image Ik we 
find the highest match value of /: maxMatch(I , /&). The higher the value, 
the stronger the match. However, not every high match value is statistically 
significant. The statistical significance of maxMatch(I , Ik) is measured by 
how many standard deviations it is away from the mean match value of / in 
the entire collection of images, i.e.,: 

(maxMatch(IJk) ~ avgMatch(I)^J /stdMatch(I) 

where avgMatch(I) is the mean of all match values of / in the collection 
ii, and stdMatch(I) is their standard deviation. We thus define the 



We use the same algorithm employed by Shechtman and Irani (2007) to match en- 



sembles of self-similarity descriptors, which is a modified version of the efficient "ensemble 



matching" algorithm of Boiman and Irani (2007). This algorithm employs a simple prob- 



abilistic "star graph" model to capture the relative geometric relations of a large number 
of local descriptors, up to small non-rigid deformations. 
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'Significance' of a subimage I as: 



1 K 

Significance{I\I\^ Ik) = — StatSignificance (maxMatch(I, Ik 



K 

k=l 

Initially, we have no candidate sketch S. However, we can measure how 
'significantly common' is each w x h subimage of ii, when matched 

against all locations in all the other K— 1 images. We can assign a significance 
score to each pixel p G Ik (k = 1, .., K), according to the 'Significance' of its 
surrounding w x h subimage: Significance[I v k \I\^ ...,I K ). 

We set Q. to be the pixel location with the highest significance score in 
image i.e., c& = argmax p ^i k {Significance{I k \Ii ) ...,/#)}. 

The resulting K points (one per image), Ci, c^, provide the centers for 
K candidates of 'non-trivial' common image parts. We generate a sketch S 



from these image parts (using the algorithm of Sec. 3.4). 

We repeat the above process, this time for / = 5, to detect its best 
matches in ii, Ik- This should lead to improved detection and localization 
of the common object (ci,...,Cx), and accordingly to an improved sketch 
S. This algorithm can be iterated several times. In practice, in all our 
experiments a good sketch S was recovered already in the first iteration. 
An additional iteration was sometimes useful for improving the detection. 



Fig. 3.5 shows two iterations of this process, applied to 4 input images. More 



results of the detection can be seen in Fig. |3.6| 

Handling unknown w x h: In principle, when w x h is unknown, we can 
run the above algorithm "exhaustively" for a variety of w = w min) .., w max 
and h = h min) ..^h^ax^ and choose "the best" w x h (with maximal signifi- 
cance score). In practice, this is implemented more efficiently using "integral 
images" , by integrating the contributions of individual self-similarity descrip- 
tors into varying window sizes w x h. 

Computational Complexity: The detection algorithm is implemented 
coarse-to-fine. The first step of the algorithm described above is quadratic 
in the size of the input images. However, since the number of images is 
typically small (e.g., 3 — 5), and since the quadratic step occurs only in the 
coarsest /smallest resolutions of the images, this results in a computationally 
efficient algorithm. 

3.4 Sketching the Common 

Let I\ , . . . , Ik be the w x h subimages centered around the common ob- 
ject (detected and extracted from the input images using the algorithm of 



Sec. 3.3). The goal of the sketching process is to produce a binary image 
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The input images: 



First iteration: 





Second iteration: 



Fig. 3.5: Iterations of Detection & Sketching: Left: The 4 input images. 

Right: The first iteration of the detection algorithm results in 4 detected 
image regions, of which 3 are correct and one is an outlier (marked by 
red). The resulting sketch produced from these regions is reasonably good 



(due to the robustness of the sketching to outliers - see Sees. 3.4 and 3.5), 
and is used for refining the detection in the input images. This results in 
4 correct detections in the second iteration, and an improved sketch. 
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Fig. 3.6: Detecting and sketching the common: (Left) The input images. 

(Upper-Right) The detected image regions of the common object, includ- 
ing one outlier. (Lower- Right) The resulting sketch. 
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S, which best captures the rough characteristic shape of the object shared 



by 7i, as posed by Eq. (3.2). Namely, find S whose ensemble of self- 

similarity descriptors is as similar as possible to the ensembles of descrip- 
tors extracted from 7i, . . . , If we were to neglect the binary constraint 



S(l) G { — 1,1} in Eq. (3.2), and the requirement for consistency between 
descriptors of an image, then the optimal solution for the collection of self- 
similarity descriptors of S, {di}fj{, could be explicitly computed as: 

di = median/ejrf^} if Li-norm (3.3) 
di = mean/ejdf} if L 2 -norm 

We use the Li-norm to generate these 'combined' descriptors {di}fj[, be- 
cause of the inherent robustness of the median operator to outliers in the 



descriptors (also confirmed by our empirical evaluations in Sec 3.5). Having 
recovered such a collection of descriptors for S, we proceed and solve the 
"inverse" problem - i.e., to generate the image S from which these descrip- 
tors emanated. However, the collection of descriptors {di}fJ[ generated via 
a 'median' or 'average' operations is no longer guaranteed to be a valid col- 
lection of self-similarity descriptors of any real image (binary or not). We 
thus proceed to recover the simplest possible image S whose self-similarity 
descriptors best approximate the 'combined' descriptors {di}™J[ obtained by 



Eq. fl3.3p . 

Self-similarity descriptors cover large image regions, with high overlaps. 
As such, the similarity and dissimilarity between two image locations (pixels) 
of S are implicitly captured by multiple self-similarity descriptors and in dif- 



ferent descriptor entries. The self-similarity descriptor as defined in Shecht- 



man and Irani (2007) has values in the range [0, 1], where 1 indicates high 
resemblance of the central patch to the patches in the corresponding log- 
polar bin, while indicates high dissimilarity of the central patch to the 
corresponding log-polar bin. For our purposes, we stretch the descriptor 
values to the range [—1,1], where 1 signifies "attraction" and —1 signifies 
"repulsion" between two image locations. 

Let W be a wh x wh matrix capturing the attraction/repulsion between 
every two image locations, as induced by t he collection of the 'combined' 
self-similarity descriptors {di}™J[ of Eq. (3.3). Entry w^ in the matrix is the 



degree of attraction/repulsion between image locations U and lj ) determined 
by the self-similarity descriptors di and dj centered at those points. di(lj) 



is the value of the bin containing location lj in descriptor di (see Fig. 3.7). 
Similarly, dj(lj) is the value of the bin containing location ^ in descriptor dj. 
The entry w^ gets the following value: 



Wij = ctij (di{lj) + dj{li)) /2 (3.4) 
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Fig. 3.7: Computing attraction/repulsion matrix W: The log-polar self- 
similarity descriptor d{ is located at l{ (red cross). White bins signify 
image areas of high similarity to the central patch, dark bins signify image 
areas of dissimilarity to the central patch. The point lj (blue cross), 
which is the center of descriptor dj (not drawn), falls in a white bin of 
descriptor d{ (i.e., < d{{lj) < 1). The entry Wij in the matrix W 
is determined accordingly: = [d{(lj] + dj(li)) /2, where o>ij (the 
certainty assigned to this entry), is inversely proportional to the distance 
II h ~ h II (th e distance between the red and blue crosses). Similarly, the 
point Ik (green cross), which is the center of another descriptor d^ (also 
not drawn), falls in a dark bin of descriptor d{, i.e., — 1 < di(lk) < 0, 
and aik < QLij (because the green cross falls farther away from the center 
of di, hence lower certainty). 

where = ctji is inversely proportional to the distance || l — lj || between the 
two image locations (we give higher weight to bins that are closer to the center 
of the descriptor, since they contain more accurate/reliable information). 

Note that a 'pure' attraction/repulsion matrix W of a true binary image 
S contains only 3 types of values Wif —1, 0, 1. If U and lj belong to the same 
region in S (i.e., both in foreground or both in background), then = 1; if 
li and lj belong to different regions in S, then — — 1, and if the points are 
distant (out of descriptor range), then wij = 0. In the general case, however, 
the entries span the range [—1,1], where 1 stands for "strong" attraction, —1 
for "strong" repulsion and means "don't care". The closer the value of 
to 0, the lower its attraction/repulsion confidence; the closer it is to ±1, the 
higher the attraction/repulsion confidence. 

Note that W is different from the classical affinity matrix used in spectral 
clustering or in min-cut, which use non-negative affinities, and their value is 
ambiguous - it signifies both high- dissimilarity as well as low- confidence. The 
distinction between 'attraction', 'repulsion', and 'low-confidence' is critical 
in our case, thus we cannot resort to the max-flow algorithm or to spectral 
clustering in order to solve our problem. An affinity matrix with positive 



and negative values was used by Yu and Shi (2001) in the context of the 
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(a) 





Fig. 3.8: Detecting and sketching the common: (a) Five input images, (b) 
The resulting sketch. 



normalized-cut functional. However, their functional is not appropriate for 
our problem (and indeed did not yield good results for S when applied to our 
W). We therefore define a different functional and optimization algorithm 
in order to solve for the binary sketch S. 

The binary image S which best approximates the attraction/repulsion 
relations captured by W ', will minimize the following functional: 



mm 

s 



^WijiSih) - S(lj)) 2 subject to S(l) e {-1, 1} (3.5) 



h3 



where S(l) is the value of S at pixel I. Note that for a binary image, the 
term (S(li) — S(lj)) 2 can obtain only one of two values: (if both pixels 
belong to foreground, or both belong to background), or 4 (if one belongs 
to the foreground, and one to the background). Thus, when is positive 
(attraction), S(k) and S(lj) should have the same value (both 1 or both — 1), 
in order to minimize that term Wij(S(li) — S(lj)) 2 . The larger (stronger 
confidence), the stronger the incentive for S{li) and S(lj) to be the same. 
Similarly, a negative (repulsion) pushes apart the values S(li) and S(lj). 
Thus, S(li) and S(lj) should have opposite signs in order to minimize that 
term Wij(S(li) — S(lj)) 2 . When « (low confidence), the value of the 
functional will not be affected by the values S(k) and S(lj) (i.e., "don't 
care"). It can be shown that in the 'ideal' case, i.e., when W is generated 



from a binary image S, the global minimum of Eq. (3.5) is obtained at S. 



Solving the constrained optimization problem: The min-cut problem 
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where only non-negative values of are allowed can be solved by the max- 
flow algorithm in polynomial time. However, the weights in the functional 



of Eq. (3.5) can obtain both positive and negative values, turning our 'cut' 



problem as posed above into an NP-hard problem. We therefore approximate 



Eq. (3.5) by reposing it as a quadratic programming problem, while relaxing 
the binary constraints. 

Let Dbea diagonal matrix with Da = , and let L = D — W be the 

graph Laplacian of W. Then \ "}2ij^ij(S{h) — S(lj)) 2 = S T LS. Thus, our 
objective function is a quadratic expression in terms of S. The set of binary 
constrains are relaxed to the following set of linear constraints —1 < S(l) < 1, 
resulting in the following quadratic programming problem: 

S = argmm S T LS s.t. - 1 < S(l) < 1 (3.6) 

Since L is not necessarily positive semi-definite, we do not have a guarantee 
regarding the approximation quality (i.e., how far is the achieved numerical 
solution from the optimal solution). Still, our empirical tests demonstrate 
good performance of this approximation. We use Matlab's optimization tool- 
box (quadprog) to solve this optimization problem and obtain a sketch S. In 
principle, this does not yield a binary image. However, in practice, the re- 
sulting sketches look very close to binary images, and capture well the rough 
geometric shape of the common objects. 



The above sketching algorithm is quite robust to outliers (see Sec. 3.5), 
and obtains good sketches from very few images. Moreover, if when con- 
structing the attraction/repulsion matrix W we replace the 'combined' de- 
scriptors of Eq. ( |3.3D with the self-similarity descriptors of a single image, 
our algorithm will produce 'binary' sketches of a single image (although these 
may not always be visually meaningful). An example of a sketch obtained 
from a single image (using all its self-similarity descriptors) can be found in 
Fig. [3A 



3.5 Experimental Results 



Figs. |3.H3.3|3.6|3.8|3l)[3.10| show qualitative results on various image sets. 
In all of these examples the number of input images was very small (3 — 7), 
with large variability in appearance and background clutter. Our algorithm 
was able to detect and produce a compact representation (a sketch) of the 
common content. 

We further conducted empirical evaluations of the algorithm using ETHZ 
shape dataset Ferrari et al (2006). This dataset consists of five object cat- 
egories with large variability in appearance: Applelogos, Bottles, Giraffes, 
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Fig. 3.9: Detecting and sketching the common: (a) The input images, (b) 
The resulting sketch. 
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Fig. 3.10: Sample results on ETHZ shapes [Ferrari et aL\ ( |2006[ ) dataset: 

Detection and sketching using only 3 images (left), and using 6 images 
(right). 
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Fig. 3.11: Evaluating sketch 

quality: Mean values of 
Quality (S) as a function 
of the number of input 
images (K — 2,...,10 y ) 
randomly sampled from 
each set of ETHZ shape 
dataset Fer rari et al.l 
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Fig. 3.12: Sketching in presence 
of outliers: We "cor- 
rupt" a set of 10 "inlier" 
with n randomly chosen 
natural images. Graph 
shows mean values of 
Quality (S) as a function 
of the percent of outlier 
images in the input set, 
i.e., nj (10 + n). 
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Fig. 3.13: Detection in new im- 
ages: We empirically 
evaluated how well the 
sketch generated form very 
few images (K — 2, ..10,) 
performs in detecting the 
common shape in new im- 
ages. 



3. Sketching the Common 



46 



Mugs and Swans (example images can be seen in Fig. 3.10). There are 
around 50 images in each set, with ground-truth information regarding the 
location of the object in each image, along with a single hand-drawn ground 
truth shape for each category. In order to assess the quality of our algorithm 
(which is currently not scale invariant, although it can handle up to ±20% 
scale variation, and ±15° rotations), we scaled the images in each dataset to 
have roughly the same object size (but we have not rotated the images, nor 
changed their aspect ratios). 

Sketch quality score: Because our sketch S is continuous in the range 
[— 1, 1], we stretch the values of the ground-truth sketch Sqt also to this 
range, and multiply the two sketches pixel-wise. Our sketch quality score 
is: Quality (S) = < S,Sgt >/(# of pixels). In places where both sketches 
agree in their sign (either white regions or black) the pixel-wise product is 
positive, while in places where the sketches disagree, the product is nega- 
tive. This produces a sketch quality score with values ranging between —1 
(lowest quality) to +1 (highest quality). Note that even if our sketch dis- 
plays a perfect shape, its quality will be smaller than 1, because it is not a 
perfect binary image. From our experience, sketch quality > 0.8 are usually 
excellent-looking sketches. 

We first assessed the quality of our algorithm to identify and sketch the 
common object correctly, as a function of the number of input images K (K = 
2,3, .., 10). We randomly sampled K images out of an object category set, 
applied our detection and sketching algorithm to that subset, and compared 
the resulting sketch S to the ground-truth Sgt> We repeated this experiment 



15 times for each X, and computed mean sketch quality scores. Fig. |3.11 
displays plots of the mean quality score for the 5 categories. It can be seen 
that from relatively few images (K = 3) we already achieve sketches of 
good quality, even for challenging sets such as the giraffes (although, with 
the increased number of example images, its legs tend to disappear from 
the sketch because of their non-rigid deformations). Examples for sketching 
results for some of these experiments can be seen in Fig. |3.10[ 

We next evaluated the robustness of the sketching component of our algo- 
rithm to outliers. Such robustness is important, since the detection algorithm 
often produces outlier detections (see Fig. 3.5). We used 10 "inlier" images 
which alone generate a good sketch with high sketch quality score. We then 
added to them n = 1,...,30 outlier images (cropped at random from nat- 
ural images). For every such 10 + n image set we generated a sketch, and 
compared it to the ground-truth. Each experiment was repeated 15 times. 
Fig. 3.12 displays plots of sketch quality vs. percent of outliers n/(10 + n). 
Our sketching method is relatively robust to outliers, and performs quite well 
even in presence of 50% outliers (as expected due to the median operation 
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in Eq. Q). 

In addition to sketch quality evaluation we tested the performance of 
our algorithm in the scenario described in the Introduction: given a very 
small number of example images, how useful is the output of our automatical 
detection & sketching algorithm for successfully detecting that object in new 
images. For K = 2, 3, ...,10, we randomly sampled K images out of an object 
category set, applied our detection & sketching algorithm to that subset, and 
used the resulting sketch to detect the object in the remaining 50 — K images 
of that category set. We consider an object in image I n as "detected" if the 
location of maxMatch(S, I n ) (the detected center c n of the object) falls no 
farther away than 1 /4 of the width or height of the bounding-box from the 
ground-truth center. We repeated each experiment 40 times and plotted the 
average detection rates in Fig. |3.13| For the Apples, Bottles, and Swans we 
get high detection rates (for as few as K = 3 example images; a scenario 
no WSL method can handle to the best of our knowledge). However, our 
detection rates are not as good in the Giraffe set, since the giraffes undergo 
strong non-rigid deformations (they sometimes tilt their necks down, and 
their legs change positions). Our current algorithm cannot handle such strong 
non-rigid deformations. 



4. NEGATIVE AFFINITIES] 



Clustering is a fundamental task in unsupervised learning. The focus of this 
chapter is the Correlation Clustering (CC) functional which combines posi- 
tive and negative affinities between pairs of data points. In this chapter we 
provide a theoretical analysis of the CC functional. Our analysis suggests 
a probabilistic generative interpretation for the functional, and justifies its 
intrinsic "model-selection" capability. In addition we suggest two new appli- 
cations that utilize the "model-selection" capability of CC: unsupervised face 
identification and interactive multi-object segmentation by rough boundary 
delineation. 

The resulting CC energy is arbitrary and is very difficult to approximate. 
We defer the discussion on our approximate minimization algorithms for the 
CC energy to chapter [7] in part |III[ which deals with approximation schemes 
for arbitrary energies. 



4.1 Introduction 



One of the fundamental tasks in unsupervised learning is clustering: grouping 
data points into coherent clusters. In clustering of data points, two aspects 
of pair- wise affinities can be measured: (i) Attraction (positive affinities), i.e., 
how likely are points i and j to be in the same cluster, and (ii) Repulsion 
(negative affinities), i.e., how likely are points i and j to be in different 
clusters. 

Indeed, new approaches for clustering, recently presented by |Yu and Shi 



(2001 


) and 


Bansal et al. 


(2004) 



information. Normalized cuts was extended by Yu and Shi (2001 ) to allow for 
negative affinities. However, the resulting functional provides sub-optimal 
clustering results in the sense that it may lead to fragmentation of large 
homogeneous clusters. 

The Correlation Clustering functional (CC), proposed by Bansal et al. 



(2004), tries to maximize the intra-cluster agreement (attraction) and the 



inter-cluster disagreement (repulsion). Contrary to many clustering objec- 



1 This is joint work with Meirav Galun 
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tives, the CC functional has an inherent "model-selection" property allowing 



to automatically recover the underlying number of clusters (Demaine and 



Immorlica 



Sec. 4.2 



(2003)). 



focuses on a theoretical probabilistic interpretation of the CC 
functional. The subsequent sections (Sec. 4.3 and 4.4) present two new ap- 
plications. Both these applications build upon integrating attraction and 
repulsion information between large number of points, and require the ro- 
bust recovery of the underlying number of clusters k. 

This chapter focuses on the CC functional, its properties and derived 
applications. We defer to chapter [7] our novel approach to CC optimiza- 
tion. Experimental results presented in this chapter were produced using 
our algorithms, which are described in more detail in chapter [7j 



Correlation Clustering (CC) Functional 

Let W G ]R nxn be an affinity matrix combining attraction and repulsion: for 
Wij > we say that i and j attract each other with certainty |W#|, and for 
Wij < we say that i and j repel each other with certainty \Wij\. Thus the 
sign of tells us if the points attract or repel each other and the magnitude 
of W^ indicates our certainty. 

Any fc-way partition of n points can be written as U G {0, l} nx s.t. 
Ui C = 1 iff point i belongs to cluster c. Ui C — 1 Vi ensure that every i 
belongs to exactly one cluster. 



The CC functional maximizes the intra-cluster agreement (Bansal et al. 
(2004)). Given a matrix W^J an optimal partition U minimizes: 



Scc(U) = x>oE r << r '< c 4 - 1 ) 

ij c 

s.t. u ic e{o,i}, J2 u ic = 1 

c 

Note that Ui c Uj c equals 1 iff i and j belong to the same cluster. For 
brevity, we will denote J2 C Ui c Uj c by \UU T ^.. from here on. 



4.2 Probabilistic Interpretation 

This section provides a probabilistic interpretation for the CC functional. 
This interpretation allows us to provide a theoretic justification for the "model 

2 Note that W may be sparse. The "missing" entries are simply assigned "zero certainty" 
and therefore they do not affect the optimization. 
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selection" property of the CC functional. Moreover, our analysis exposes the 
underlying implicit prior that this functional assumes. 

We consider the following probabilistic generative model for matrix W. 
Let U be the true unobserved partition of n points into clusters. Assume 
that for some pairs of points z,j we observe their pairwise similarity values 
Sij. These values are random realizations from either a distribution / + or 
/~, depending on whether points z, j are in the same cluster or not. Namely, 



p\s 



p\s 



[uu T ] 



13 



13 



f + (s) 

r 00 



Assuming independency of the pairs, the likelihood of observing similar- 
ities {s^} given a partition U is then 



£({s lJ }\u) = ]jr(sJ uu %.r 



r i-\uu T ] 



J i3 > 



To infer a partition U using this generative model we look at the posterior 
distribution: 

Pr(U\{s tJ })cx£({s tJ }\U)-Pr(U) 

where Pr (U) is a prior. Assuming a uniform prior over all partitions, i.e., 
Pr (U) = const, yields: 



Pr(U\{ Slj })^l[f + ( Sij ) 



uu 1 



r 



J i3 J 



[l-[UUT] 



Then, the negative logarithm of the posterior is given by 

-lDgPr(^|{ a y}) = C + J]log/ + (^)[[/t/ T ] M . 

ij 



J] log/" ( Sij ) (l-[UU T ] 



13 



13 



where C is a constant not depending on U. 

Interpreting the affinities as log odds ratios In- 
terior becomes 



log 



f~( s ij) 



log Pr(U\{ SlJ }) = C-^^j[UU t ] 



13 



the pos- 
(4.2) 



That is, Eq. (4.2) estimates the log-posterior of a partition U . Therefore, a 
partition U that minimizes Eq. (4.2) is the MAP (maximum a-posteriori) 
partition. Since Eq. (4.1) and Eq. (4.2) differ only by a constant they share 
the same minimizer: the MAP partition. 
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Fig. 4.1: Prior on the number of clusters k: Graph shows — log Pr (k) , for 

uniformly distributed U . The induced prior on k takes a non-trivial shape: 
it assigns very low probability to the trivial solutions of k — 1 and k — n, 
while at the same time gives preference to partitions with non-trivial k. 
The mode of this prior is when U has roughly t-^- clusters. 



4.2.1 Recovering k (a.k.a. "model selection") 

We showed that the generative model underlying the CC functional has a 
single model for all partitions, regardless of k. Therefore, optimizing the CC 
functional one need not select between different generative models to decide 
on the optimal k. Comparing partitions with different k is therefore straight 
forward and does not require an additional "model complexity" term (such 
as BIC, MDL, etc.) 

As described in the previous section the CC functional assumes a uniform 
prior over all partitions. This uniform prior on U induces a prior on the 
number of clusters fc, i.e., what is the a-priori probability of U having k 
clusters: Pr (k) = Pr (U has k clusters). We use Stirling numbers of the 
second kind (Rennie and Dobson ( |1969[ )) to compute this induced prior on 



k. Fig 4.1 shows the non-trivial shape of this induced prior on the number 



of clusters k. 



4.3 Interactive multi-object segmentation (Patent Pending^ 

Negative affinities in image segmentation may come very naturally from 
boundary information: pixels on the same side of a boundary are likely to be 



3 This work was published in the 3 rd International Conference on Information Science 
and Applications (ICISA), 12012 
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(a) Input image and boundary 
scribbles (red) 



(b) Resulting segmentation 



Fig. 4.2: Interactive multi-object segmentation: (a) The user provides only 
crude and partial indications to the locations of boundaries between the 
relevant objects in an image (red), (b) The output of our algorithm cor- 
rectly segments the image into multiple segments. Image was taken from 
Alpert et al.| pOOfy 




Fig. 4.3: Prom boundary scribbles to affinity matrix: (a) A boundary scrib- 
ble is drawn by the user (red), inducing "figure/ ground" regions on its 
opposite sides (black and white regions), (b) For each scribble we use the 



method of Levin et al. \200<y) to generate a soft segmentation of the image 
into two segments: pixel values in the soft segmentation are in the range 
[—1, 1]. Pixels far away from the scribble are assigned as it is uncertain 
to what segment they should belong to. Each pixel i is described using 
a segmentation membership vector v\ with an entry corresponding to its 
assignment at each soft segmentation (red columns), (c) A non-zero en- 
try Wij in the sparse affinity matrix is the correlation between normalized 
vectors V{ and vy. = vfvj/ \\vi\\ ■ \\vj\\. We also add strong repulsion 
across each scribble. 
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in the same segment (attraction), while pixels on opposite sides of a bound- 
ary are likely to be in different segments (repulsion) . We use this observation 
to design a novel approach to interactive multi-object image segmentation. 



Instead of using k different "strokes" for the different objects (e.g., Sant- 



ner et a/.| ( |201l |)), the user applies a single "brush" to indicate parts of the 



boundaries between the different objects. Using these sparse and incomplete 
boundary hints we can correctly complete the boundaries and extract the de- 
sired number of segments. Although the user does not provide at any stage 
the number of objects fc, our method is able to automatically detect the num- 
ber of segments using only the incomplete boundary cues. Fig. |4.2| provides 
an example of our novel interactive multi-object segmentation approach. 
Computing affinities: Fig. |4.3| illustrates how we use sporadic user-provided 
boundary cues to compute a sparse affinity matrix with both positive and 
negative entries. Note that this is a modification of the affinity computation 
presented by Stein et al. ( 2008[ ): (i) We use the interactive boundary cues to 
drive the computation, rather than some boundaries computed by unsuper- 
vised technique, (ii) We only compute a small fraction of all entries of the 
matrix, as opposed to the full matrix of Stein et a/, (hi) Most importantly, we 
end up with both positive and negative affinities in contrast to Stein et al 
who use only positive affinities. 

The sparse affinity matrix W is very large (~ 100/c x lOOfc). Existing 
methods for optimizing the correlation clustering functional are unable to 
handle this size of a matrix. Chapter [7] describes in detail our novel approach 
to CC optimization that enables us to optimize such large scale problems. 
We applied our Swap-and- Explore algorithm (Alg. [2j described in 7.4) to 
this problem and it provides good looking results with only several minutes 
of processing per image. 

Fig. |4.4| shows input images and user marked boundary cues used for 
computing the affinity matrix. Our results are shown at the bottom row. 

The new interface allows the user to segment the image into several co- 
herent segments without changing brushes and without explicitly enumerate 
the number of desired segments to the algorithm. 



4.4 Clustering and face identification 

This application shows that detecting the underlying number of clusters k 
can be an important task on its own. Given a collection of face images we 
expect the different clusters to correspond to different persons. Identifying 
the different people requires not only high purity of the resulting clusters 
but more importantly to correctly discover the appropriate number of clus- 
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Fig. 4.4: Interactive segmentation results. Input image and user boundary 



cues (top), our result (bottom). Images were taken from \Alpert et al. 



ters. This experiment is an extension of existing work on the problem of 
"same/not-same" learning. Following recent metric learning approach (e.g., 
Guillaumin et al. (2009, 2010| )) we learn a single classifier that assigns a 
probability to each pair of faces: "how likely is this pair to be of the same 
person". Then, using this classifier, we are able to determine the number 
of persons and cluster the faces of unseen people. That is, given a new set 
of face images of several unseen people, our clustering approach is able to 
automatically cluster and identify how many different people are in the new 
set of face images of never seen before people. 



For this experiment we use PUT face dataset Kasinski et al. (2008). The 
dataset consists of 9971 images of 100 people (roughly 100 images per per- 
son). Images were taken in partially controlled illumination conditions over 
a uniform background. The main sources of face appearance variations are 
changes in head pose, and facial expression. 



We use the same method as Guillaumin et al. (2009) to describe each 



face. SIFT descriptors are computed at fixed points on the face at multiple 
scales. We use the annotations provided in the dataset to generate these 



keypoints. Given a training set of labeled faces {a^, Vi}f =1 we use a state-of- 



the-art method by Guillaumin et al. (2010) to learn a Mahalanobis distance 
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L and threshold b such that: 

Pr (y { = yj\x u Xj] L,b) = a (b - (x { - Xj) T L T L (x { - Xj^j 

where a(z) = (1 — e~ z )~ 1 is the sigmoid function. 

For each experiment we chose k people for test (roughly 100 • k images), 
and used the images of the other 100 — k people for training. The learned 
distance is then used to compute py, the probability that faces i and j be- 
long to the same person, for all pairs of face images of the k people in the 
test set. The affinities are set to Wij = log • We apply our clustering 
algorithm to search for an optimal partition, and report the identified num- 
ber of people k' and the purity of the resulting clusters. We experimented 
with k — 15, 20, . . . , 35. For each k we repeated the experiments for several 
different choices of k different persons. 

In these settings all our algorithms, described in Sec. |7.4[ performed 
roughly the same in terms of recovering k and the purity of the resulting clus- 
tering. However, in terms of running time adaptive-label ICM completed the 
task significantly faster than other methods. We compare Swap- and- Explore 



(Alg. [2] of Sec. 7.4) to two different approaches: (i) Connected components: 
Looking at the matrix of probabilities p^-, thresholding it induces k f con- 
nected components. Each such component should correspond to a different 
person. At each experiment we tried 10 threshold values and reported the 
best result, (ii) Spectral gap: Treating the probabilities matrix as a positive 



affinity matrix we use NCuts Shi and Malik (2000) to cluster the faces. For 



this method the number of clusters k' is determined according to the spectral 
gap: Let be the i th largest eigenvalue of the normalized Laplacian matrix, 
the number of clusters is then k f arg max A/ 



1 K+i 



Fig. |4.5| shows cluster purity and the number of different persons k' iden- 
tified as a function of the actual number of people k for the different meth- 
ods. Our method succeeds to identify roughly the correct number of people 
(dashed black line) for all sizes of test sets, and maintain relatively high 
purity values. 



4.5 Conclusion 

This chapter provides a generative probabilistic interpretation for the Cor- 
relation Clustering functional, justifying its intrinsic "model selection" ca- 
pability. Using a generative probabilistic formulation allows for a better 
understanding of the functional, underlying assumptions it makes, including 
the prior it imposes on the solution. 



4. Negative Affinities 



56 



o 



o 



CD 

120 

CD 

CD 
> 
O 

o 10 

CD 



1 1 1 


T 


1 1 1 





15 20 25 30 35 
Number of different people 



0.9 



>>0 . 6 
0.3 






15 20 25 30 35 
Number of different people 



Fig. 4.5: Face identification: Graphs showing our result (Swap), spectral and 

connected components. Left: recovered number of people (k f ) vs. number 
of people in the test set. Dashed line shows the true number of people. 
Right: purity of resulting clusters. 

Optimizing large scale CC and robustly recovering the underlying number 
of clusters allows us to propose new applications: interactive multi-label 
image segmentation and unsupervised face identification. 



5. 3D SHAPE RECONSTRUCTION BY COMBINING 
MOTION AND LIGHTING CUES0 



In this chapter we consider the problem of reconstructing the 3D shape of a 
moving object while accounting for the change of intensities due to a change 
in orientation with respect to the light sources. We assume that both the 
lighting and motion parameters are given. Two methods are presented. First, 
for lambertian objects illuminated by a point source we derive a PDE that 
is quasilinear and implicit in the surface shape. We propose to solve this 
PDE by continuation (characteristic curves), extending an existing method 
to allow for large motion. Secondly, we formulate the reconstruction problem 
as an MRF and solve it using discrete optimization techniques. The latter 
method works with fairly general reflectance functions and can be applied 
to sequences of two or more images. It can also incorporate prior informa- 
tion and boundary conditions. We further discuss a method for extracting 
boundary conditions. We demonstrate the performance of our algorithms 
by showing reconstructions of smooth shaped objects and comparing these 
reconstructions to reconstructions with laser scans. 



5.1 Introduction 

Moving objects change their appearance as they change their orientation 
with respect to a light source. Such changes make it difficult to identify 
corresponding points across images, and as a result complicate the task of 
motion recovery and subsequently also of 3D reconstruction. Existing ap- 
proaches to motion analysis largely assume either brightness constancy, or 
rely on extracting distinctive features. These approaches successfully han- 
dle both polygonal shapes and objects with noticeable surface markings, but 
they are less suitable for handling objects with smooth shapes and gradual 
albedo variations. 

In this paper we propose a framework for reconstructing the shape of 
such objects that explicitly takes lighting effects into account. Our work is 
based on the observation that the changes in both the location and intensity 

1 This is joint work with Meirav Galun and Ronen Basri 
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of a moving object across images are coupled by its motion. Therefore, these 
two sources of information can be combined to constrain the matching of 
points across images, allowing us to achieve a veridical reconstruction. A 
number of previous studies (see Section 5.2) used this observation to propose 
algorithms for shape recovery either in multi-view settings or in restricted 



settings (single directional light source and small motion, Basri and Frolova 



( 2008[ )) when only two views are available. In this study we propose a method 
that can work with two or more images and that can deal with general lighting 
conditions and fairly large motion. 

We consider the case of an object that moves rigidly with respect to the 
camera and the light source. We assume that the object is lambertian, and 
that both the lighting conditions and the motion parameters are known. We 
present two algorithms. First we address the problem of shape reconstruc- 
tion from two views when lighting is composed of a single directional source, 
and derive a partial differential equation (PDE) that can be solved to recover 
the shape of the object using continuation (characteristic curve). We further 
show how we can derive boundary conditions for this method. This formu- 
lation extends the work of Basri and Frolova (2008) to objects undergoing a 
large motion. 

Our second algorithm uses the PDE formulation to construct a cost func- 
tion on a graph, representing the domain to be reconstructed. The cost 
function takes the form of a Markov Random Field (MRF). We then use 
off-the-shelf algorithms to solve for the sought shape. The MRF formulation 
offers several advantages over previous work. (1) It can work with fairly gen- 
eral reflectance functions. (2) It can be applied to pairs of images as well as 
to sequences of three or more images. (3) It is more robust to errors than 
continuation solutions, which can accumulate errors in integration. Finally, 
(4) prior information can be incorporated; in particular, the method can use 
boundary conditions, but as we show in our experiments boundary condi- 
tions are not essential. Experiments with real smooth objects demonstrate 
the utility of our formulation. 

The paper is divided as follows. Section [572] provides a brief summary of 
related work. Section |5.3| defines the reconstruction problem and derives a 
continuation solution in the case of large motion. Section [574] casts the prob- 
lem as an MRF optimization and discusses its solution by discrete techniques. 



Section |5.5| explains how we compute boundary conditions, and Section |5.6 
shows the results of our experiments. 
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5.2 Previous work 



The majority of 3D reconstruction techniques use either motion or shad 
ing techniques, but rarely combine both cues. Shape from shading Horn 



and Brooks (1989) and photometric stereo Woodham (1980) utilize shading 
cues to reconstruct objects from single or multiple images of static objects. 
Motion is often handled with the assumption of brightness constancy |Horn| 



and Schunk (1981); Lucas and Kanade (19811) or by matching sparse feature 



points (e.g., 



Cryer et al. 


(1995 


); 


Fua 



and Leclerc (1995). Note that in a stereo setup lambertian objects retain 



their intensities across images. Another set of studies seek to generalize the 



brightness constancy assumption to account for local lighting variations Black 



require a multi- frame setting (typically at least 4 images) . 



et al. (200C 


)); Negahdaripour (1998); 


Haussecker and Fleet 


(2001). 


Studies that use motion and shading cues simultaneously Carceroni and 


Kutulakos 


(2001); Jin et al. (2008|); Maki et al. (2002); Mukawa ( 


1990); 


Simakov et al. ( 


2003); Weber et al. 


(2004); Zhang et al.\ 


2003); Jos! 


li and 


Kriegman (2007 


); Chen et al. (|2009|); Moses and Shimshoni 


(2006). typically 



'hese studies also 



estimate the light source direction along with the shape of the object. An 



exception is Basri and Frolova (2008), which requires only two images, but 



assumes small motion. Our work improves over these methods by proposing 
reconstruction methods that can handle large motion and can work both with 
image pairs and with larger sequences of images. In addition, our second al- 
gorithm allows for general lighting settings, which include point and extended 



light sources, through the use of spherical harmonic representations (Basri 
and Jacobs (2003); R.Ramamoorthi and P.Hanrahan ( 200ip ). 



5.3 Problem definition and solution by continuation 

We consider a pair of images taken by a stationary camera. The images depict 
a lambertian object moving rigidly in space and illuminated by constant 
lighting. Denote these images by /(x,?/) and J(x,y). Let P = (x,?/, z{x,y)) 
be a point on the object's surface described in the coordinate frame of /, 
and let p and q denote its projections onto I and J. Assuming a weak 
perspective projection p = (x, y), and q = sRP + t in J, where the scale 
s > 0, rotation R (represented by a 2 x 3 matrix with orthonormal rows) and 
translation t G 3? 2 describe the (known) motion of the object. In general, 
since the motion is assumed to be known the location of q depends on the 
unknown depth value of P, z. We therefore often emphasize this dependence 
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by writing q(z). 

Denote the normal to the surface at p in / by n G 5 2 , 

A I {~ z xi ~ z yi I)? 

Y ^ T T 1 

where z x = dz/dx, z y = dz/dy, and denote the albedo at p by p G 3ft. Thus, 
the normal at q(z) is given by Rn. We assume that the intensity of P in 
the two images can be expressed as /(p) = pr(n) and J(q) = pr(i?n), where 
r : 5 2 — >> 3?, commonly referred to as the reflectance function, is a (known) 
function of the surface normal. This expression can be used to model a vari- 
ety of materials. In particular, it can model lambertian objects illuminated 
by a directional ("point") source by setting r(n) = max(l T n,0). Here the 
direction of the point source is given by 1/||1|| and its intensity by ||1||. Like- 
wise, under more general lighting conditions (that may include multiple point 
and extended sources) r(n) can be expressed as an inner product between a 
vector of coefficients b G K d and the spherical harmonic functions evaluated 
at n, denoted Y(p) G d is set to either 4 or 9 if respectively the first or 



second order harmonic approximation is used, see Basri and Jacobs (2003) 
for details of this model. 

Under these assumptions we can eliminate the albedo p by taking the 
ratio between the intensities of corresponding points, namely 

J(q(*)) _ r(Rn) 
/(p) r(n) ' 1 • ' 

This PDE captures the relation between the shape of the moving object, its 
motion and the environment lighting. The shape of the object is captured 
both implicitly by the correspondence between p and q(z) (in other words, 
this equation is implicit in z\ and explicitly by the surface normal (i.e., the 
partial derivatives of z). 

For a lambertian object illuminated by a single directional light source 



1 Eq. (5.1) can be written as follows. Let n = {—z X) —z y) 1) T (so that n = 
n/||n||) then 

Note that in this equation the nonlinear term ||n|| cancels out. Rearranging 
this equation we obtain 

a T n = 0, (5.3) 

where a(x,T/, 2) = J(q(z))l — I(p)R T l. The equation is quasi-linear (i.e., 
linear in the derivatives z x and z y ) ) although implicit in z 
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Basri and Frolova] ( |2008[ ) made this equation explicit in z by using a Taylor 



approximation for J(q(z)) and used the obtained expression to recover the 
shape of Lambertian objects illuminated by a single directional source. As 



their method approximates (5.1) to first order, it can handle only objects 



undergoing very small motion. Specifically, assuming / and J are rectified 



such that their epipolar lines are horizontal, they showed that (5.1) can be 
written as an explicit, quasilinear PDE 

az x + bz y = c, (5.4) 

where 

a(x,y,z) = h(I - zJ x ) - l 3 I 

b(x,y,z) = l 2 (Io-zJ x ) 

c(x, y, z) = l s (I e - zJ x ) + IJ. 

In this equation J x = dJ/dx, Iq = {J — I)/ 9 and 9 denotes the angle of rota- 



tion about the vertical axis. They further described a solution to (5.4) using 
the method of continuation (characteristic curves) and showed a method to 
extract Dirichlet boundary conditions along the bounding contour of an ob- 
ject. 

Our first contribution is to note that Basri and Frolova's algorithm can 



be extended also to handle large motion. This is because (5.3) is quasi-linear 



even without the Taylor approximation, and so it too can be solved by the 



method of continuation. Note that, due to ( |5.3| ), a lies on the tangent to the 
surface z(x,y) at p. Therefore, any curve 7(f) C 3? 3 whose tangent at every 



t e [0,1] is given by a(7(t)) is characteristic to (5.3), and if 7(0) happens 
to lie on z{x,y) the entire curve will lie on z{x,y). To recover z{x,y) the 
continuation method traces a family of such curves {j(t)} by integrating 
the vectors a(7(t)) starting from a ID set of 3D points given as Dirichlet 



boundary conditions. Unfortunately, as (5.3) is implicit in z, extracting 



boundary conditions can be complicated; we suggest a method to do this in 
Section 15.51 

Finally, we note that the characteristic curves traced with this procedure 
are all plane curves that lie in parallel planes. This can be readily seen by 
noticing that a is a linear combination of 1 and and so a T (l x Rl) = 0. 
in general, unless 1 points in the direction of the axis of rotation (denoted 
u) these parallel planes are perpendicular to u and are independent of 1. 
Note however that generally the planes that contain the characteristics do 
not coincide with the epipolar planes except when the axis of rotation is 



parallel to the image plane. In the case that 1 coincides with u, (5.2) becomes 
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Fig. 5.1: Reconstructions from rendered images of a toy hippo, using continuation. 

The left column shows the original images (rotation of 4°), the middle col- 
umn shows the 3D model and the right column shows the reconstructed 
surface. The top row shows the surface (colormap represents depth val- 
ues) and the bottom row shows the surface painted with intensity values. 



degenerate since its right hand side (substituting u for 1) is 1. Consequently, 
^(q(^)) = an d methods that assume constant brightness can be applied. 
Figure |5.1| shows a reconstruction obtained with our suggested method for 
continuation. A toy model was rendered using point source light and rotated 
by angle of about 4°. The continuation method was applied with exact 
boundary condition values. 



5.4 Discrete optimization 
The continuation method suffers from several shortcomings. First, the method 



accumulates errors in the integration. Moreover, (5.1) is quasi-linear when 
the lighting is composed of a single directional light source, but is non-linear 
when more realistic lighting models are used. Continuation can in principle 
be applied also for non-linear equations, but it is then significantly less ro- 
bust. In addition, the method relies on boundary conditions that are difficult 
to compute, and at the same time does not allow for the inclusion of other 
prior information. Finally, it does not provide a way to integrate information 
from more than two images when more images of the object are available. 
To overcome all of these shortcomings we cast the problem as an MRF 
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optimization and solve it using discrete optimization techniques. Our objec- 



tive is to find a valid surface z(x,y) that satisfies (5.1). We therefore define a 
cost function composed of unary and binary terms. Our MRF is defined over 
a grid overlaid on the first image I(x,y). Each grid point (x, y) is associated 
with a state vector determining the parameters of the surface element (surf el) 
observed at that pixel. These parameters include both the depth value and 
the surface normal, i.e., (z, z X) z y ) ) of the surfel. Note that such state vectors 
cannot represent points along the silhouette contour as the derivatives of z 
at these points diverge. Our cost function combines unary and binary terms 



over the grid points. The unary term is a function of the residual of (5.1). 
Specifically, let 

T = I(p)r(Rh) - J(q)r(n). (5.5) 
We define the unary term as 

El = exp(aT 2 ) - 1, (5.6) 

where a is a constant (a = 8 in our experiments). The binary term is set to 
prefer integrable surfaces. For p = (x,y) we use 

E b v = (z(x ) y)+z x (x ) y)-z(x+l ) y)f + (z(x ) y)+Zy(x ) y)-z(x ) y+l)) 2 . (5.7) 

Our final energy function is given by 

where Qj denotes the silhouette of the object in I and A is constant. (We 
used values around 0.001.) 

In general, when two images are used for reconstruction we need to supply 



boundary conditions in the form of y) along a ID curve 70 . In Section 5.5 
we describe a technique to estimate the depths near the bounding contours 
of the object. This procedure provides both the depth values and the nor- 
mals along a ID contour in Qj. Given these boundary conditions we simply 
modify the unary cost along this contour to vanish in states (z ) z X) z y ) that 
coincide with the depth values and normals predicted by the boundary condi- 
tions. As the extraction of boundary conditions can be noisy we then let the 
optimization modify these depth values. When the optimization is applied 
to three or more images the additional images further constrain the solution, 
and so boundary conditions are not necessary. 



The binary term is clearly non-submodular. We therefore optimize (5.8) 



by a sequence of alpha expansion iterations implemented with the QBPO 
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algorithm Rother et al (2007). Each iteration solves a binary decision prob- 
lem in which each variable is allowed to either change its current state to a 
fixed state (z a ,z^,2^) or retain its current state. For submodular energies 



such an iteration is implemented using the st-mincut algorithm Kolmogorov 



and Zabih (2004); Boykov et al (2001). The QPBO algorithm extends the 



st-mincut algorithm to handle negative edge capacities arising in the case 
of non-submodular pairwise terms. The algorithm constructs a graph with 
two complimentary vertices for each variable, explicitly representing its two 
possible binary states. By using a redundant variable representation, the 
resulting graph can be constructed to have positive capacities in all of its 
edges, and therefore the st-mincut algorithm can be applied to it. The fol- 
lowing rule is used to label the variables at the end of the st-mincut step. 
Each variable for which its two complimentary vertices fall in different sides 
of the cut is labeled according to the assignment induced by the cut. Vari- 
ables for which both complimentary vertices fall in the same side of the cut 
remain unlabeled. As minimizing non-submodular energies generally is NP- 
hard, QBPO is not guaranteed to label all the variables after each st-mincut 
step, and some of the variables may remain unlabeled. These unlabeled vari- 
ables retain their original label from the previous iteration. This procedure 
is repeated for every possible choice of a label z£). The entire batch 

of iterations is then repeated until convergence. The optimization is initial- 
ized by setting the initial state (z,z x ,z y ) for every point to the state that 
minimizes the unary term (5.6). 

The discrete optimization solver is followed by a continuous quadratic 
optimization, in order to relax the quantized labeling (discrete values) of 
the surface. We solve the following quadratic optimization functional, main- 
taining the integrability constraint while remaining close to the quantized 
solution: 

ecz = y) - ^ v)f + (5-9) 

where z(x, y) is the solution obtained by the discrete optimization, and/x = 5 
in our experiments. 

Finally, note that this procedure can readily be applied also when the 
lighting changes between images, by using different reflectance functions for 
I and J in (5.5). 



5.5 Boundary conditions 

Using the differential relation ( |5.1[ ) to reconstruct a shape from two images 
requires boundary conditions in the form of depth values - one depth value is 
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required along each characteristic curve. This is essential for the continuation 
method and can be useful for the discrete optimization framework to further 
constrain the solution. Obtaining depth values can be complicated, as they 
require finding the correspondence between pixels on a smooth objects. This 



was bypassed in Basri and Frolova (2008) since under a small motion and with 



the Taylor expansion (5.4) becomes explicit in z, allowing one to compute 
the depths at the silhouette contours of the object from the intensities along 
the contour. Below we describe a method that computes depth values near 
the silhouette contours when z is implicit. 

Denote the object silhouette in I and J respectively by Qj and Qj. We 
further denote by d v Qi the portion of the silhouette contour in I that remains 
visible in J and likewise by d v Qj the portion of the silhouette contour in J 
that remains visible in /. Let P be a rim point in the coordinate frame of 
I such that its projection onto /, denoted p, lies on d v Qi. Since p lies on 
the bounding contour of Qj the surface normal at P should be parallel to 
the normal to the occluding contour at p, and so it can be derived from the 
image. Let n(p) = (cos /3, sin /?, 0) T . Plugging this normal in ( |5.lD gives us a 
scalar for the right hand side, and using the known value /(p) we compute 
the anticipated intensity at the corresponding point J(q). We can then use 
this intensity to search along the epipolar line on J to determine q. This 
for itself can be complicated, since there may be multiple points in along the 
epipolar line in J with the anticipated intensity, but in addition we face two 
problems: (1) the intensity values along the bounding contour are unreliable, 
and (2) neither the continuation nor our discrete optimization scheme can 
use normals that are parallel to the image plane. 

To overcome these problems we move from p one pixel in the direction 
of the normal into a point p' inside Qj. We assume that the surface normal 
at p' can be expressed by n(p') = (cos/Jcos^sin/Jcos^sin?/;) 11 for some 
unknown angle ip. Our aim then is to find a point q' in J along the epipolar 



line of p', denoted e(pQ, that satisfies (5.1) with this normal. This equation 
has two unknowns, the corresponding point q' in Image J and ip. We there- 
fore regularize the problem by requiring the curve near d v Qi to be mapped 
to a continuous curve in J. 

We define the following optimization function. 



ain V r(p^,q^,^) 2 + /x 1 ||q^-q, || 2 + /x 2 dist(e(p^),q;)^ 

A*3||q< - Qi-ill 2 + /^(cos^ - cos^_i) 2 , 



mm 



where T(.) denotes the residual defined in (5.5) for a pair of points p^ and 
q^ and normal n(p') parameterized by ^, and dist(e(p£), q£) denotes the 



5. 3D Shape Reconstruction by Combining Motion and Lighting Cues 66 



Euclidean distance between the point and the epipolar line e(pj). For 
q° we use the point along e(p$) closes to the boundary (at the same side 



as p) that has the intensity anticipated from (5.1). This heuristic is useful 
with reasonable rotations. To solve this minimization we substitute for J(q f i ) 
in T(q^^) ( |5.5 ) its Taylor expansion around q°. The resulting system of 
equations is non-linear and we solve it by alternate minimization. 

We repeat this procedure for d v £lj to obtain boundary conditions from 



both sides of the object's silhouette. Figure [572] shows an example for bound- 
ary conditions obtained with our method. 




Fig. 5.2: The figure shows boundary conditions computed for two images of a 
bear toy. The red curve segments depict boundary points p' in the left 
image and their selected corresponding points in the right image. The 
yellow curve segments depict correspondences computed in the opposite 
direction. 



5.6 Experiments 

We tested our algorithm on two sets of real images and compared them to 
reconstructions with laser scans. Each image is taken with dark background 
to allow segmentation of the foreground object. In each image we estimated 
the motion parameters by manually marking points on the image and the 
3D mesh object. As the objects are smooth and contain almost no clear 
surface markings our motion estimates are not perfect. Subsequently, using 
the mesh we estimated the environment lighting conditions by calculating 
the 4 coefficients of the first order harmonic representation of the lighting, 
representing the ambient lighting and a directional source. The obtained mo- 
tion and lighting parameters were used as input to our software. The figures 
below show reconstructions obtained with our optimization algorithm (Sec- 
tion 5.4). We measure the quality of the matches obtained using the RMSE 
error in pixels. The figures show input images of a bear and elephant toys, 
reconstructions from pairs of images with and without boundary conditions 
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Fig. 5.3: Input images of a bear toy. A laser scan is shown at the center, and 
relative viewing angles are provided. 



(Section 5.5), and reconstructions from multiple images. Our experiments 
cover a range of rotations between 4° to about 21°. It can be seen that for 
the tested our reconstructions achieve low RMSE values. Interestingly, there 
is no noticeable advantage to using boundary conditions. 



5. 7 Conclusion 



We described methods for reconstructing the shape of a moving object that 
accounts for intensities changes due to a change in orientation with respect 
to the light sources. In particular, we presented a continuation method and 
a method based on discrete optimization. Our experiments demonstrates 
the utility of our methods. The setting requires knowledge of the motion 
and lighting parameters. We plan in the future to seek ways to relax those 
limiting assumptions. 
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Fig. 5.4: Reconstructions from image pairs. Each box shows from left to right the 
laser scan, reconstruction with and without boundary conditions. The 
top row in each box shows the surface (colormap represents depth values) 
and the bottom row shows the surface painted with intensity values. Re- 
constructions are shown for images e and g (9.3°) both achieving RMSE 
of 1.30 pixels (top left box), images d and f (10.1°) both achieving RMSE 
of 1.52 pixels (top right), a and b (12.1°) achieving 2.24 and 2.20 pixels, 
and c and e (16.3°) achieving 3.89 and 3.87. 




c d 



Fig. 5.5: Input images of a elephant toy. A laser scan is shown at the center, and 
relative viewing angles are provided. 
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Fig. 5.6: Reconstructions from image pairs. Each box shows from left to right 
the laser scan, reconstruction with and without boundary conditions. 
The top row in each box shows the surface (colormap represents depth 
values) and the bottom row shows the surface painted with intensity 
values. Reconstructions are shown for images f and b (7.5°) achieving 
RMSE 2.14 and 2.12 (top left box) and images b and f (7.5°) achieving 
RMSE 1.93 and 1.94 (top right box) 
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Fig. 5.7: Reconstruction from multiple images. Ground truth is shown at left, 
image labels at the top and RMSE (over all images) at the bottom. 
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Fig. 5.8: Reconstruction from multiple images. Ground truth is shown at left, 
image labels at the top and RMSE (over all images) at the bottom. 
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Fig. 5.9: Reconstruction from multiple images. Ground truth is shown at left, 
image labels at the top and RMSE (over all images) at the bottom. 



6. LEARNING DISCRETE ENERGIES] 



This chapter introduces a new formulation for discrete image labeling tasks, 
the Decision Tree Field (DTF), that combines and generalizes random forests 
and conditional random fields (CRF) which have been widely used in com- 
puter vision. In a typical CRF model the unary potentials are derived from 
sophisticated random forest or boosting based classifiers, however, the pair- 
wise potentials are assumed to (1) have a simple parametric form with a 
pre-specified and fixed dependence on the image data, and (2) to be defined 
on the basis of a small and fixed neighborhood. In contrast, in DTF, local 
interactions between multiple variables are determined by means of decision 
trees evaluated on the image data, allowing the interactions to be adapted to 
the image content. This results in powerful graphical models which are able 
to represent complex label structure. Our key technical contribution is to 
show that the DTF model can be trained efficiently and jointly using a con- 
vex approximate likelihood function, enabling us to learn over a million free 
model parameters. We show experimentally that for applications which have 
a rich and complex label structure, our model outperforms state-of-the-art 
approaches. 

6.1 Introduction 



The last decade has seen the meteoric rise in the use of random field models in 



computer vision Szeliski et al (2008). Random fields have been used to model 



many problems including foreground-background (fg-bg) segmentation Blake 



et al (2004); 



Shotton et al 



Boykov et al (2001), semantic segmentation He et al 



(2007); Winn and Shotton (2006), stereo Boykov et al 



optical flow Baker et al (2007); Roth and Black (2009); Horn and Schunk 



(1981), and 3D reconstruction Snow et al (2000); Vogiatzis et al (2005). 



(2004); 



(2001 



Many of these problems can be cast as an image labeling problem, where we 
are given an image x and need to predict labels y. Random fields provide a 
way of factorizing the joint distribution p{x, y) or the posterior distribution 



1 This is joint work with Sebastian Nowozin, Carsten Rother, Toby Sharp, Bangpeng 
Yao and Pushmeet Kohli. It was published in the 13^ International Conference on Com- 



puter Vision (ICCV), 2011 
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p(y\x) into a product of local interactions. 

In the classic Markov random field (MRF) we obtain the posterior distri- 
bution p(y\x) by integrating a per-pixel likelihood functions with pairwise 
consistency potentials ensuring a smooth solution Geman and Geman (1984); 



Li ( 1995). One major advance in the field was to make these smoothness cost 



dependent on the local image structure Boykov et al. (2001), conditioning 
parts of the model on the input data. In the last decade, these conditional 
random field (CRF) models iLafferty et al. (2001); Sutton and McCallum 



(2007); He et al. ( |2004D have become popular for their improved ability to 
capture the relationship between labels and the image. 

A lot of research effort has been devoted at the development of efficient 
algorithms for estimating the Maximum a Posteriori (MAP) solution of such 



models 


Felzenszwalb and Huttenlocher (2006); 


Szeliski et al. 


( 


2008 


); 


Kol- 


mogorov (2006); 


Komodakis and Tziritas (2007 


), and the same is 


true for 


algorithms for probabilistic inference Wainwrighl 


t and Jordan 


(2008 


); 


Roller 



and Friedman (2009). Further, a large number of methods have been pro- 



posed to learn the parameters of random field models Anguelov et al. ( 



Kumar et al. (2005); Sutton and McCallum (2007); Szummer et al.\( 



( 


2005 




2008 



Zhang and Seitz (2005). A number of higher order random field fornm 



ations 



have also been proposed that are able to encode interactions between groups 
of pixels, and have been shown to produce much better results |Kohli et al.\ 
( |2008[ ); Roth and Black (2009). However, despite these rapid developments, 



(most) state-of-the-art random field CRF models continue to suffer from the 
following limitations: (1) they are defined on the basis of a fixed neighbor- 



hood structure (except the work of Kolmogorov and Boykov (2005); Roth 



and Black ( 2007| )), and (2) the potentials are assumed to have a simple para- 



metric form with a pre-specified and fixed dependence on the image data. 
While it is relatively easy to think of various ways to overcome these limita- 
tions, the key research challenge is to suggest a model for which efficient and 
high-quality training is still tractable. 

This paper introduces a new graphical model, the Decision Tree Field 
(DTF), which overcomes the above-mentioned limitations of existing models. 
We take a simple yet radical view: every interaction in our model depends on 
the image, and further, the dependence is non-parametric. It is easy to see 
that even representing such a model is extremely challenging, since there are 
numerous ways of defining a mapping between the image and the parameters 
of a unary or pairwise interaction in the graphical model. 

Our model uses decision trees to map the image content to interaction 
values. Every node of every decision tree in our model is associated with 
a set of parameters, which are used to define the potential functions in the 
graphical model. When making predictions on a novel test instance, the leaf 
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node of the decision tree determines the effective weights. 

There are a number of important reasons for the choice of decision trees 
to specify the dependence between potentials and image content. Firstly, 
decision trees are non-parametric and can represent rich functional relation- 
ships if sufficient training data is available. Secondly, the training of deci- 
sion trees is scalable, both in the training set size and in that the approach 
can be parallelized; recent advances even allows training on graphics pro- 
cessors [Sharp] (120081) . Since for most computer vision applications it is well 



known that the key to obtaining high predictive performance is the amount 
of training data, many recent works use decision trees, or a related variant 



of it (random Forests Breiman (2001), extremely randomized trees Geurts 



et al\ Q2006 ), semantic texton forest Shotton et al ( |2008D ). In our context 
decision trees give another big advantage: they allow us to efficiently and 
jointly learn all parameters in the model. We achieve this by using a log- 
concave pseudo-likelihood objective function, which is known to work well 



given enough training data because it is a consistent estimator Koller and 



Friedman (2009). 



Our Contributions 

(1) To the best of our knowledge, we propose the first graphical model for 
image labeling problems which allows all potential functions to have an ar- 
bitrary dependence on the image data. 

(2) We show how the dependence between potential functions and image data 
can be expressed via decision trees. 

(3) We show how the training of the DTF model, which involves learning of 
a large number of parameters, can be performed efficiently. 

(4) We empirically demonstrate that DTFs are superior to existing models 
such as random forest and common CRFs for applications with complex label 
interactions and large neighborhood structures. 



6.2 Related Work 



There has been relatively little work on learning image-dependent potential 
functions, i.e. the "conditional part" of a random field. Most algorithms for 
learning the parameters of a random field try to learn a class-to-class energy 



table that does not depend on the image content 


Anguelov et al. 


(200[ 


>); 


Ba- 


tra et al ( 


2008 


); 


Nowozin and Lampert 


(2009 


); 


Szummer et al. 


(2008 


); r 


faskar 


et al 


(2005). However, there have been few attempts at learning the param- 



eters of conditional potentials Cho et al (2010); Prasad et al (2006); Gould 



et al (2009). Recently, Gould et al Gould et al (2009) used a multiclass lo- 
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gistic regression classifier on a set of manually selected features, such as the 
length and orientation of region boundaries to obtain an image-dependent 
learned model for pairwise interactions. Even more recently, Cho et al. |Cho| 



et al. (2010) proposed a model for image restoration whose interactions were 



dependent on the semantic meaning of the local image content as predicted 
by a classifier. Unlike our work, all the above-mentioned models either target 
specific tasks, or assume a particular form for the dependence of the poten- 
tials on the image content. Neither of the above-mentioned approaches is able 
to learn a dependency model with thousands or even millions of parameters 
which our model can achieve. 



Decision trees are popularly used to model unary interactions, e.g., Shot- 



ton et al.\ (2011); but with two exceptions they have not been used for pair- 



wise or higher-order interactions. The first exception is the paper of Glesner 
and Koller Glesner and Koller (1995), where decision trees are used to model 
conditional probability tables over many discrete variables in a Bayesian net- 
work. The difference to our work is that our decision trees evaluate the given 
image content, not the state of a random variable, thus requiring no change 
to the inference procedure used. 



The second exception is the "random forest random field" Payet and 



Todorovic (2010). Despite the similarity in name, the approach is funda- 



mentally different from ours. Instead of defining an explicit model as we do 
in (6.2), Payet and Todorovic Payet and Todorovic (2010) define the model 
distribution implicitly as the equilibrium distribution of a learned Metropolis- 
Hastings Markov chain. The Metropolis-Hastings ratio is estimated by clas- 
sification trees. This is a clever idea but comes with a number of limitations, 
i) at test-time there is no choice between different inference methods but one 



is bound to using inefficient Markov Chain Monte Carlo (MCMC); in Payet 



and Todorovic ( |2010[ ) superpixel graphs of few hundred regions are used and 



inference takes 30 seconds despite using advanced Swendsen-Wang cuts, and 
ii) the model remains implicit, such that inspecting the learned interactions 
as we will do in Section [6. 5. 3| is not possible. 

In a broader view, our model has a richer representation of complex label 
structure. Deep architectures, such as Lee et al. (2009) and latent variable 
CRFs, as in Schnitzspan et al. (2010), have the same goal, but use hidden 



variables representing the presence of larger entities such as object parts. 
While these models are successful at representing structure, they are gen- 
erally difficult to train because their negative log-likelihood function is no 
longer convex. In contrast, by learning powerful non-parametric conditional 
interactions we achieve a similar expressive power but retain convexity of the 
training problem. 



6. Learning Discrete Energies 



6.3 Model 

We now describe the details of our model. Throughout we will refer to x e X 

as a given observed image from the set of all possible images X . Our goal 

is to infer a discrete labeling y G 3^, where the labeling is per-pixel, i.e., we 

have y = (yi)iev? V% ^ A where all variables have the same label set C. We 

describe the relationship between x and y by means of an energy function E 

that decomposes into a sum of energy functions E tp over factors F, where F 

defines a subsets of variables. For example, for a pairwise factor it is \F\ = 2. 

We have _^ 

E(y,x,w) = 2^ E tF (y F ,x F ,w tF ). (6.1) 

By y F we denote the collection (yi) ieF , and likewise we write x F to denote 
the parts of x contained in F. While there may be many different subsets in 
J 7 , we assume they are of few distinct types and denote the type of the factor 
F by t F . The function E tp is the same for all factors of that type, but the 
variables and image content it acts upon differs. Furthermore, the function 
is parameterized by means of a weight vector w tp to be discussed below. 



A visualization of a small factor graph model is shown in Figure 6.2 It has 



three pairwise factor types (red, blue, and green) and two unary factor types 



(black and turquoise). All factors depend on the image data x. Figure 6.3 
shows the "unrolled" factor graph for an image of size 4-by-3 pixels, where 
the basic model structure is repeated around each pixel i G V, and pairwise 
factors which reach outside the image range are omitted. In total we have 
\F\ = 31 factors. 



The energy function (6.1) defines a conditional probability distribution 
p(y\x,w) as 

p(y\x, w) = — exp(-E(y, x, w)), (6.2) 

Z{x, w) 

where Z(x,w) = ^ yey exp(—E(y,x,w)) is the normalizing constant. So 
far, our model is in the general form of a conditional random field |Lafferty 



et al. 


(2001) 


in ( 


6.1 


)• 



With each function E t we associate one decision tree. To evaluate E tp (y F ,x F , Wt F ), 
we start at the root of the tree, and perform a sequence of tests s on the 
image content x F) traversing the tree to the left or right. This process is 



illustrated in Figure 6J_ When a leaf node has been reached, we collect the 
path of traversed nodes from the root node to the leaf node. With each node 
q of the tree we associate a table of energy values w tp (q,y F ). Depending on 
the number of variables y F this energy function acts on, the table can be a 
vector (unary), a matrix (pairwise), or general ^-dimensional array (higher 
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order). We sum all the tables along the path taken and compute the energy 
as 

E tF (y F) x F) w tF ) = ^2 w t F (Q,yF), 

where Path(xp) denotes the set of nodes taken during evaluating the tree. By 
using one set of weights at each node we can regularize the nodes at the root 
of the tree to exert a stronger influence, affecting a large number of leaves; 
at test-time we can precompute the summation along each root-to-leaf path 
and store the result at each leaf. 



To compute the overall energy (6.1 ) 
we evaluate E tp for all factors F G 
T . Although the type tp might be 
the same, the function E tp depends 
on xp through the evaluation of the 
decision tree. This allows image- 
dependent unary, pairwise, and higher- 
order interactions. The set T is de- 
termined by repeating the same lo- 
cal neighborhood structure for each 



pixel, as shown in Figures 6.2 and 6.3 




In summary, our model consists 
of a set of factor types. Each fac- 
tor type contains, (i) the number 

k of variables it acts on and their Fig. 6.1: Summation of all energy tables 
relative offsets, (ii) a single decision along the path of visited deci- 

tree, and (hi) for each node in the sion nodes (shaded blue), 

decision tree, a table of energies of size C k . Given a new image for each 
possible labeling y we can evaluate E(y,x,w) by the above procedure. 



6.3.1 Relation to other Models 
The proposed DTF generalizes a number of popular existing image labeling 



methods. If we ignore pairwise and higher-order interactions in (6.1), then 



the variables are independent and making predictions for each pixel is the 



same as evaluating a random forests, as used in e.g. Shotton et al. (2008); 



Tu and Bai (2010). Interestingly, as we will show in the experiments, even 



in this setting we still slightly outperform standard random forests since we 
learn the weights in each decision node instead of using empirical histograms; 
this novel modification improves predictive performance without any test- 
time overhead compared to random forests. For pairwise interactions we 
generalize simple CRFs with contrast-sensitive pairwise potentials such as 
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Fig. 6.2: Neighborhood 

structure around 
each pixel with 
five different factor 
types. 

the popular Grab Cut system Rother et al. ( |2004a[ ) and TextonBoost Shotton 
et al. (2007). Finally, if for the pairwise interactions we use decision trees of 



Fig. 6.3: Unrolled factor graph (im- 
age size 4-by-3 pixels), de- 
pendencies on x and w are 
not shown. 



depth one, such that these interactions do not depend on the image content, 



then our model becomes a classic Markov random field prior Li (1995). 



6.4 Learning Decision Tree Fields 

Learning the model involves selecting the neighborhood structure, the de- 
cision trees, and the weights stored in the decision nodes. During learning 
we are given an iid set {{x m) 2/^)} m =i,...,M of images x m and ground truth 
labelings y^. Our goal is to estimate the parameters w of our model such as 
to predict for a given x m . For simplifying the derivation of the learning 
method, we can treat the given set of images as if it would be one large 



collection of pixels as is done in Sutton and McCallum (2007). 



6.4.1 Maximum Likelihood Learning 

For learning the parameters of our model, we need to elaborate on how the 
parameters w define the energy. One important observation is that for a fixed 



set of decision trees the energy function (6.1) can be represented such that it 



is linear in the parameters w. To see this, consider a single E tp (y F) x F) w tp ) 
function and define a binary indicator function 



B tF (q,z]y F ,x F ) 



1 if n e Path(x j p) and z = y F) 
otherwise. 
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Then, we can write the energy E tF (y F) x F) w tF ) equivalently as a function 
linear in w tp) 

w t F (q,z)B tF (q,z]y F ,x F ). (6.3) 

nGTree(t F ) zey F 

The use of decision trees allows us to represent non-linear functions on x. 



Although non-linear in cc, by the representation (6.3) we can parameterize 
this function linearly in w tp . Then, from (6.3) we see that the gradient has 



a simple form, V WtF ^ z) E tF (y F) x F) w tp ) = B tp (q, z;y F: x F ). 

Because (6.1) is linear in w ) the log-likelihood of (|6.2[ ) is a concave and dif- 
ferentiate function in w (Roller and Friedman, 2009, Corollary 20.2). This 



means that if computing Z(x, w) and the marginal distributions p(y F \x, w) 
for all F e T would be tractable, then learning the parameters by maximum 
likelihood becomes a convex optimization problem. 

We now show how to use efficient approximate likelihood methods to 
learn all parameters associated to the decision trees from training data. For 
now we assume we are given a fixed set of factor types, including decision 
trees, but have to learn the weights/energies associated to the nodes of the 
trees. We will discuss how to learn trees later. 



6.4.2 Pseudolikelihood 



The pseudolikelihood Besag (1977) defines a surrogate likelihood function 
that is maximized. In contrast to the true likelihood function computing 
the pseudolikelihood is tractable and very efficient. The pseudolikelihood is 
derived from the per- variable conditional distributions p(yi\y^^y^ w). By 
defining £i(w) = — log p(yi\y^^ : x, w) we can write the regularized negative 
log-pseudolikelihood £ np i(w) as the average £i over all pixels, 



-y 



i( w ) -^2^ogp t (w t ), 



(6.4) 



where pt(w t ) is a prior distribution over w t used to regularize the weights. We 
will use multivariate Normal distributions A/*(0,a t /), so that — logp t (w t ) is 
of the form ^\\w t \\ 2 + C t (cr t ) and the constant C t ((Jt) can be omitted during 
optimization because it does not depend on w. For each factor type t the 
prior hyperparameter cr t > controls the overall influence of the factor and 
we need to select a suitable value by means of a model selection procedure 
such as cross validation. 

Function (6.4) is convex, different iable, and tractably computable. For 



optimizing (6.4) we use the L-BFGS numerical optimization method Zhu et 



al. (1997). To use L-BFGS we need to iteratively compute £i(w) and the 
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gradient V Wt £i(w). The computation of U{w) and V Wt ii(w) is straightfor- 
ward 



£i(w) = ^2 E F (y* F ,x,w tF ) 



FeM(i) 



+ iog ex P - Ef (y^yv\{i}i x i w t F ) ) (6.5) 

Vi^Pi \ FeM(i) 

V Wt £i{w) = ^2 ^w t E F (j/*, x, w t ) 



FeM t (i) 



" E ( I * 



FeM t (i) 



V Wt E F (yi,yv\ {i} ,x,w t ) 



(6.6) 



where we use M{i) to denote the subset of T that involves variable y^ and 
M t (i) likewise but restricted to factors of matching type, i.e., M t (i) = {F e 



M(i) : t F — t}. By summing (6.5) and (6.6) over all pixels in all images, 
we obtain the objective and its gradient, respectively. When initializing the 
weights to zero we have approximately || V w £ np i(w)\\ w 1. During optimiza- 
tion we stop when || V w £ np i(w)\\ < 10 -4 , which is the case after around 
100-250 L-BFGS iterations, even for models with over a million parameters. 

6.4.3 Learning the Tree Structure 

Ideally, we would like to learn the neighborhood structure and decision trees 
jointly with their weights using a single objective function. However, whereas 
the weights are continuous, the set of decision trees is a large combinatorial 
set. We therefore propose to use a simple two-step heuristic to determine 
the decision tree structure: we learn the classification tree using the train- 
ing samples and the information gain splitting criterion. This greedy tree 
construction is popular and known to work well on image labeling prob- 



lems Shotton et al. (2008). The key parameters are the maximum depth of 
the tree, the minimum number of samples required to keep growing the tree, 
and the type and number of split features used. As these settings differ from 
application to application, we describe them in the experimental section. Un- 
like in a normal classification tree, we store weights at every decision node 
and initialize them to zero, instead of storing histograms over classes at the 
leaf nodes only. 

The above procedure is easily understood for unary interactions, but now 
show that it can be extended in a straightforward manner to learn decision 
trees for pairwise factors as well. To this end, if we have a pairwise factor we 
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consider the product set C x C of labels and treat each label pair (Zi,^) £ 
£ x £ as a single class. Each training pair of labels becomes a single class in 
the product set. Given a set of such training instances we learn a classification 
tree over \C\ 2 classes using the information gain criterion. Instead of storing 
class histograms we now store weight tables with one entry per element in 
C x C The procedure extends to higher-order factors in a straightforward 
way. 

Once the trees are obtained, we set all their weights to zero and opti- 



mize (6.4). During optimization the interaction between different decision 
trees is taken into account. This is important because the tree structures 
are determined independently and if we were to optimize their weight in- 
dependently as well, then we would suffer from overcounting labels during 
training. The same overcounting problem would occur if we would want to 
use the class histograms at the leaf nodes directly, for example by taking the 
negative log-probability as an energy. 

6.4.4 Complexity of Training 



The complexity to compute the overall objective (6.4) and its gradient is 
0(\V\ • \C\ • JV), where V is the set of pixels in the training set, C is the 
label set, and TV is the number of factors in the neighborhood structure. 
Note that this is linear in all quantities, and independent of the order of the 
factors. This is possible only because of the pseudolikelihood approximation. 
Moreover, it is even more efficient than performing a single sweep of message 
passing in loopy belief propagation, which has complexity 0(|V| • \C\ k • N) 
for factors of order k > 2. 

6.4.5 Making Training Efficient 

Training a graphical model on millions of pixels is computationally challeng- 
ing. We have two principled methods to make training efficient. 

First, observe that our training procedure parallelizes in every step: we 



train the classification trees in parallel Sharp (2008). Likewise, evaluat 



ing (6.4) and its gradient is a large summation of independent terms, which 
we again compute in parallel with no communication overhead. 

The second observation is that every step in our training procedure can 
be carried out on a subsampled training set. For classification trees we can 



process a subset of pixels, as in Shotton et al. (2008). Less obvious, we 



can do the same thing when optimizing our objective (6.4). The first term 



in equation (6.4) takes the form of an empirical expectation K ir ^u(v)[^i{ w )] 



that can be approximated both deterministically or by means of stochastic 



6. Learning Discrete Energies 



81 



approximation. We use a deterministic approximation by selecting a fixed 
subset V C V and evaluating t! npl (w) = ^ Z)iev £%{ w ) ~ J2t ^°SPt(^t)- We 
select V to be large enough so this computation remains efficient; typically 
|V'| has a few million elements F] 



6.4.6 Inference 



We use different inference methods during test-time. For making maximum 
posterior marginal predictions (MPM) we use an efficient Gibbs sampler. 
Because the Gibbs sampling updates use the same quantities as used for 



computing (6.5) we do not have to unroll the graph. For obtaining approxi- 



mate MAP predictions, we use the Gibbs sampler with simulated annealing 
(SA), again exploiting the model structure. Both the Gibbs sampler and 
the SA minimization is explained in the supplementary materials. To have 



a baseline comparison, we also minimize (6.1) using tree-reweighted message 



passing (TRW) by unrolling the factor graph and using the implementation 
of Kolmogorov| ( |2006 ). 



6.5 Experiments 

We considered a broad range of applications and report experiments for three 
data sets. One more experiment is reported in the supplementary materials. 
The aim is to show that the DTF enables improved performance in challeng- 
ing tasks, where a large number of interactions and parameters need to be 
considered and these cannot be manually tuned. Moreover, we show that 
conditional pairwise interactions better represent the data and lead to im- 
proved performance. As the three datasets are quite diverse, they also show 
the broad applicability of our system. 



6.5.1 Conditional Interactions: Snake D at aset 



In this experiment we construct a task that has 
only very weak local evidence for any particular la- 
bel and structural information needs to be prop- 
agated at test-time in order to make correct pre- 
dictions. Moreover, this structure is not given but 
needs to be learned from training data. Consider 
Figure 6A_ to the right, illustrating the task. A 
"snake" shown on the input image is a sequence of 
adjacent pixels, and the color in the input image 
encodes the direction of the next pixel: red means "go north" , yellow means 




Fig. 6.4: Input (left), la- 
beling (right). 



2 When sampling V' uniformly at random with replacement from V, the law of large 
numbers guarantees the asymptotic correctness of this approximation. 
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Input Truth Unary MRF DTF 



Unary samples 
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Fig. 6.5: Predictions on a novel test instance. 





RF 


Unary 


MRF 


DTF 


Accuracy 


90.3 


90.9 


91.9 


99.4 


Accuracy (tail) 


100 


100 


100 


100 


Accuracy (mid) 


28 


28 


38 


95 



Tab. 6.1: Test set accuracies for the snake data set. 



"go east" , blue means "go west" , and green means "go south" . Once a back- 
ground pixel is reached, the snake ends. Each snake is ten pixels long, and 
each pixel is assigned its own label, starting from the head (black) to the 



tail (white). Knowing about these rules, the labeling (Figure 6.4[ right) can 
be perfectly reconstructed. Here, however, these rules need to be learned 
from training instances. Of course, in a real system the unary interactions 



Shotton et al. 


(2007 


); 


Batra et al. 


(2008) 



we believe that the task distills the limitations of noisy unary interactions: 
in this task, for making perfect predictions, the unary would need to learn 
about all possible snakes of length ten, of which there are clearly too manyj^] 
We use a standard 4-neighborhood for both the MRF and the DTF mod- 
els. The unary decision trees are allowed to look at every pixel in the input 
image, and therefore could remember the entire training image. For exper- 
imental details, please see the supplementary materials. We use a training 
set of 200 images, and a test set of 100 images. 



The results obtained are shown in Table [6T] and Figure [63} Here random 
forests (RF), trained unary potentials (Unary), and the learned Markov ran- 
dom field (MRF) perform equally well, at around 91%. Upon examining this 
performance further, we discovered that while the head and tail labels are 
labeled with perfect accuracy, towards the middle segments of the snakes the 
labeling error is highest, see Table |rTT| This is plausible, as for these labels 



3 In general, the number of snakes is equal to the number of fixed-length self-avoiding 
walks on a lattice, a number which is conjectured to be exponential in the length. 
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Fig. 6.6: Test set predictions for the large occlusion case. 



the local evidence is weakest. When using conditional pairwise interactions 
the performance improves to an almost perfect 99.4%. This again makes 
sense because the pairwise conditional interactions are allowed to peek at 
the color-codes at their neighbors for determining the directionality of the 
snake. 



The predictions are illustrated for a single test instance in Figure [63} We 
see that only the DTF makes a perfect prediction. To show the uncertainty 
of the unary model, we visualize two samples from the model. 

6.5.2 Learning Calligraphy: Chinese Characters 

In the previous experiment we have used a standard 4-connected neigh- 
borhood structure. In this experiment we show that by using larger con- 
ditional neighborhoods we are able to represent shape. We use the KAIST 
Hanja2 database of handwritten Chinese characters. We occlude each char- 
acter by grey box centered on the image, but with random width and height. 
For more details, please see the supplementary materials. This is shown in 
the leftmost column of Figure |6.6| We consider two datasets, one where 
we have a "small occlusion" and one with a "large occlusion" box. Note 
that most characters in the test set have never been observed in the train- 
ing set, but a model that has learned about shape structure of Chinese 
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characters can still find plausible completions of the input image. To this 
end we use one unary factor with a decision tree of depth 15. Addition- 
ally, we use a dense pairwise neighborhood structure of 8-connected neigh- 
bors at one and two pixels distance, plus a sparse set of 27 neighbors at 
{(-9, 0), (-9, 3), (-9, 6), (-9, 9), (-6, 0), . . . , (9, 9)}. Therefore, each vari- 
able has 2 • (24 + 4 + 4) = 64 neighboring variables in the model. For the 
pairwise decision trees we use trees of depth one (MRF), or six (DTF). 
The results for the large occlu- 



sion task are shown in Figure 6.6 



Qualitatively, they show the differ- 
ence between a rich connectivity struc 
ture and conditional interactions. Ob- 
serve, for example, that the MRF 
essentially performs only a smooth- 
ing of the results while respecting lo- 
cal stroke- width constraints, as ap- 
parent from the MRF MAP predic- 

' In 



tion in the first row of Figure [6^6| 
contrast, the DTF predictions hal- 
lucinate meaningful structure that 

may be quite different from the ground Fig. 6.7: Pairwise associativity strength, 
truth but bears similarity to Chi- Please see text, 

nese characters. Note that we achieve this rich structure without the use 
of any latent variables. Because this task is an inpainting task, the quan- 
titative assessment is more difficult since the task is truly ambiguous. We 
therefore report accuracies only for the small-occlusion case, where a reason- 
able reconstruction of the ground truth seems more feasible. We measure the 
per-pixel accuracy in the occluded area on the test set. For the random for- 
est baseline we obtain 67.74%. The MRF with dense neighborhood improves 
this to 75.18% and the DTF obtains 76.01%. 

As an example of what structure the model is able to learn, consider 
the visualization of the MRF pairwise interactions shown in Figure |6.7[ The 



figure shows for each pairwise interaction the sum of learned diagonal energies 
minus the sum of cross-diagonal entries. If this value is negative (shown in 
blue) the interaction is encouraging the pixels to take the same value. Red 
marks interactions that encourage pixels to take different values. The plot 
shows that there is a strong local smoothing term, but interestingly it is 
not symmetric. This can be explained by the fact that horizontal strokes in 
Chinese characters are typically slanted slightly upwards Chen (2011). Note 
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Input / Truth 



unary 



+1,5,20 




Fig. 6.8: Test recognition results. MRF (top) and DTF (bottom). 



that we discovered these regularities automatically from the training data. 
6.5.3 Accurate body-part detection 

We consider the task of body part classification from depth images, as re- 



cently proposed in photton et al] fl201ip . Given a 2D depth image, and a 
foreground mask, the task is to label each pixel as belonging to one of 31 
different body parts, as shown in Figure [6~8t Despite the variations in pose 



and body sizes Shotton et al. (2011) obtains high-quality recognition results 



by evaluating a random forest for each pixel, testing local and global depth 
disparities. In this task, the label set has a large amount of structure, but 
it is not clear that a sufficiently complex unary classifier, when given the 
image, cannot implicitly represent this structure reasonably well. Here we 
show that by adding pairwise interactions we in fact improve the recognition 
accuracy. Moreover, once we make the interactions conditional, accuracy 
improves even further. 

The experimental setup is as follows. We use a subset of the data used 
in Shotton et al. (2011): 30 depth images for training, and 150 images for 



testing. We train 4 unary decision trees for all models. For the pairwise 
models, we use the following neighborhood sizes, (i) "+1" for adding a 4- 
neighborhood one pixel away, (ii) "+5" for an 8-neighborhood five pixels 
away, and (iii) "+20" when adding an 8-neighborhood twenty pixels away. 
In the "+1,5,20" configuration, each variable has 4+8+8=20 neighbors. For 
each of the pairwise interactions we train two trees of depth six. A more 
detailed description of the exact experimental setup can be found in the 
supplementary materials. We measure the results using the same mean per- 
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class accuracy score as used in Shotton et al. (2011). 

The results for 30 training images are shown in Table |6.2| and one in- 
stance is shown in Figure [6^8} Even without adding pairwise interactions, our 
learned unary weights already outperform the random forest classifier |Shot-| 
ton et al\ (120111). When adding more interactions (+1, +1,20, +1,5,20), 



the performance increases because dense pairwise interactions can represent 
implicit size preferences for the body parts. Likewise, when adding condi- 
tionally (MRF to DTF), the performance improves. The best performing 
model is our DTF with large structure (+1,5,20) and almost 1.5 million free 
parameters. It is trained in only 22 minutes and achieves 27.35% mean per- 



class accuracy. For the same setup of 30 images, the original work Shotton 



et al. (2011) reports a mean per class accuracy of 14.8%, while achieving an 
impressive 56.6% when scaling to 900k training images, trained for one day 
on a 1000 core cluster. 

An example of a learned pairwise interaction is shown in Figure |6.9| 
demonstrating that the improved performance of the DTF can be directly 
attributed to the more powerful interactions that are allowed to take the 
image into account. We report more results in the supplementary materials. 



Model 


Measure 


IShotton et al. 


( 


2011 


) 


unary 


+1 


+1,20 


+1,5,20 


MRF 


avg-acc 
runtime 
weights 






14.8 
lm 


21.36 
3ml8 
176k 


21.96 
3m38 
178k 


23.64 
10m 
183k 


24.05 
10m 
187k 


DTF 


avg-acc 
runtime 
weights 












23.71 
5ml6 
438k 


25.72 
17m 
951k 


27.35 

22m 
1.47M 



Tab. 6.2: Body-part recognition results: mean per-class accuracy, training time on 
a single 8-core machine, and number of model parameters. 



6.6 Conclusion 

We have introduced Decision Tree Fields as flexible and accurate models for 
image labeling tasks. This accuracy is achieved by being able to represent 
complex image-dependent structure between labels. Most importantly, this 
expressiveness is achieved without the use of latent variables and therefore 
we can learn the parameters of our model efficiently by minimizing a convex 
function. 



6. Learning Discrete Energies 



87 



Fig. 6.9: Learned horizontal interactions: The left figure shows the aver- 
age depth-normalized silhouette reaching one of the 32 leaf nodes in the 
learned decision tree. We select one leaf (marked red, enlarged) and show 
the corresponding effective 32 x 32 weight matrix obtained by summing 
the learned weights along the path from the root to leaf node. The con- 
ditional interaction can be understood by visualizing the most attractive 
(blue) and most repulsive (red) elements in the matrix. We superim- 
pose arrows for the two most attractive and repulsive interactions on test 
images (right). The first and second pose exemplify how left and right 
upper parts of the legs can appear 20 pix to the right of each other in 
a way that matches the pattern of the leaf. While a configuration like 
shown in the third and fourth pose is plausible, it does not fit the leaf 
pattern and thus the interaction is not active. 
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In the previous part several different applications in the domain of com- 
puter vision were presented to demonstrate the enhanced descriptive power 
gained by considering arbitrary energies. However, this gain comes with a 
price tag: existing optimization algorithms no longer provide good approx- 
imations in practice. This part of my work addresses this issue. It con- 
centrates around the second axis of this thesis, which focuses on practical 
methods and approaches for approximating the minimization of challenging 
arbitrary discrete energies. 

In particular, Chapter [7] proposes a discrete optimization approach to 
the correlation clustering functional presented in Chapters [3] and [4j This 
approach scales gracefully with the number of variables, better than ex- 
isting approaches (Vitaladevuni and Basri ( 2010| )). In fact, we show that 
our discrete approach to the CC optimization can handle energies defined 
over hundreds of thousands of variables, as arise in e.g., image segmentation 
(Sec. 4.3). This is by far more variables than any other existing method can 
handle. 

Chapter [8] concludes this part with a more general perspective on discrete 
optimization. This new perspective is inspired by multiscale approaches and 
suggests to cope with the NP-hardness of discrete optimization using the 
multiscale landscape of the energy function. Defining and observing this 
multiscale landscape of the energy, I propose methods to explore and exploit 
it to derive coarse-to-fine optimization framework. This new perspective gives 
rise to a unified multiscale framework for discrete optimization. The proposed 
multiscale approach is applicable to a diversity of discrete energies, both 
smoothness-encouraging as well as arbitrary, contrast-enhancing functions. 



7. CORRELATION CLUSTERING OPTIMIZATION^ 



The focus of this chapter is the optimization of the Correlation Clustering 
functional which combines positive and negative affinities between the data 
points. The main contribution of this chapter are new optimization algo- 
rithms which can cope with large scale problems (> 100 K variables) that are 
infeasible using existing methods. We draw an analogy between optimizing 
this functional and the well known Potts energy minimization. This anal- 
ogy allows us to suggest several new optimization algorithms, which exploit 
the intrinsic "model-selection" capability of the functional to automatically 
recover the underlying number of clusters. We compare our algorithms to 
existing methods on both synthetic and real data. 



7.1 Introduction 



Optimizing CC is tightly related to many graph partitioning formulations 



(Nowozin and Jegelka (2009)), however it is known to be NP-hard (Bansal 
et al. ( 2004[ )). Existing methods derive convex continuous relaxations to 
approximately optimize the CC functional. However, these algorithms do 
not scale beyond a few thousands of variables. See for example, the works 



(2010); Glasner et al. (2011). 



of Nowozin and Jegelka (2009); Bagon et al. (2010); Vitaladevuni and Basri 



This work suggests a new perspective on the CC functional, showing its 
analogy to the known Potts model. This new perspective allows us to leverage 
on recent advances in discrete optimization to propose new CC optimization 
algorithms. We show that our algorithms scale to large number of variables 
(> 100X), and in fact can tackle tasks that were infeasible in the past, 
e.g., applying CC to pixel-level image segmentation. In addition, we pro- 
vide a rigorous statistical interpretation for the CC functional and justify its 
intrinsic model selection capability. Our algorithms exploit this "model se- 
lection" property to automatically recover the underlying number of clusters 
k. 



1 This is joint work with Meirav Galun 
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7.2 CC Optimization: Continuous Perspective 



Optimizing the correlation clustering functional (Eq. (|4.1[)) is NP-hard (Bansal 



et al. ( 2004[ )). Instead of solving directly for a partition U ) existing meth 



ods optimize indirectly for the binary adjacency matrix X = UU T , i.e. 
Xij — 1 iff i and j belong to the same cluster. By introducing the binary 
adjacency matrix the quadratic objective (w.r.t. U): — Wij \UU T ^.. be- 
comes linear (w.r.t. X)\ — ^ijWijXij. The connected components of X, 
after proper rounding, are the resulting clusters, and the number of clusters 
k naturally emerges. Indirect optimization methods must ascertain that the 
feasible set consists only of "decomposable" X: X = UU T . This may be 



achieved either by posing semi-definite constraints on X (Vitaladevuni and 



Basri (2010)), or by introducing large number of linear inequalities (Demaine 



and Immorlica (2003); Vitaladevuni and Basri ( 2010| )). These methods take 



a continuous and convex relaxation approach to approximate the resulting 
functional. This approach allows for nice theoretical properties due to the 
convex optimization at the cost of a very restricted scalability. 

Solving for X requires ~ n 2 variables instead of only ~ n when solving 
directly for U. Therefore, these methods scale poorly with the number of 
variables n, and in fact, they cannot handle more than a few hundreds of 
variables. In summary, these methods suffer from two drawbacks: (i) recov- 
ering U from X is highly susceptible to noise and more importantly (ii) it is 
infeasible to solve large scale problems by these methods. 



7.3 Our New Perspective on CC 

Existing methods view the CC optimization in the context of convex relax- 
ation and build upon methods and approaches that are common practice in 
this field of continuous optimization. We propose an alternative perspective 
to the CC optimization: viewing it as a discrete energy minimization. This 
new perspective allows us to build upon recent advances in discrete opti- 
mization and propose efficient and direct CC optimization algorithms. More 
importantly, the resulting algorithms solve directly for J7, and thus scales 
significantly better with the number of variables. 



We now show how to cast the CC functional of Eq.(4.1 ) as a discrete pair- 
wise conditional random field (CRF) energy. For ease of notation, we describe 
a partition U using a labeling vector L 6 {1,2.. .} n : li — c iff Ui C — 1. A 
general form of pair- wise CRF energy is E (L) = ^% (k) + Eij (Z$, lj) 



(Boykov et al (2001)). Discarding the unary term and taking 



the pair-wise term to be Wij if k ^ lj we can re-write the CC functional as 
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a CRF energy: 



Sec (L) = W v 



■ 1 



(7.1) 



i>3 



This is a Potts model. Optimizing the CC functional can now be interpreted 
as searching for a MAP assignment for the energy (7.1). 

The resulting Potts energy has three unique characteristics, each posing 
a challenge to the optimization process: 

(i) Non-submodular: The energy is non-submodular. The notion of sub- 
modularity is the discrete analogue of convexity from continuous optimization 
(Lovasz ( 1983[ )). Optimizing a non-submodular energy is NP-hard, even for 
the binary case dRother et Z] ( [2007| )). 

(ii) Unknown number of labels: Most CRF energies are defined for a 
fixed and known number of labels. Thus, the search space is restricted to 
L G {1, . . . , k} n . When the number of labels k is unknown the search space 
is by far larger and more complicated. 

(iii) No unary term: There is no unary term in the energy. The unary 
term plays an important role in guiding the optimization process QSzeliski| 



et al (2008)). Moreover, a strong unary term is crucial when the energy in 



non-submodular (Rother et al ( 2007| )). 

There exist examples of CRFs in the literature that share some of these 



characteristics (e.g., non-submodular Rother et al (2007); Kolmogorov and 



Wainwright (2005), unknown number of labels Isack and Boykov (2011); 
Bleyer et al ( |2010| )) . Yet, to the best of our knowledge, no existing CRF 
exhibits all these three challenges at once. More specifically, we are the 
first to handle non-submodular energy that has no unary term. Therefore, 
we cannot just use "off-the-shelf Potts optimization algorithms, but rather 
modify and improve them to cope with the three challenges posed by the CC 
energy. 



7.4 Our Large Scale CC Optimization 



In this section we adapt known discrete energy minimization algorithms to 
cope with the three challenges posed by the CC energy. We derive three CC 
optimization algorithms that stem from either large move making algorithms 
(a-expand and a/5-swap of Boykov et al Q2001P ), or Iterated Conditional 
Modes (ICM) of Besag (1986). Our resulting algorithms scale gracefully 



with the number of variables n, and solve CC optimization problems that 
were infeasible in the past. Furthermore, our algorithms take advantage of 



the intrinsic model selection capability of the CC functional (Sec. 4.2) to 
robustly recover the underlying number of clusters. 
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Algorithm 1: Expand- and- Explore 
Input: Affinity matrix W eR nXn 
Output: Labeling vector LG {1,2,.. .} n 

Init Li <— 1, i — 1, . . . , n II initial labeling 

repeat 

for a <- 1 ; a < #L + 1 ; a + + do 
|^ L <— Expand (a) 

until L is unchanged 

j^L denotes the number of different labels in L. 
Expand (a): expanding a using QPBOI. 

By letting a = #L + 1 the algorithm "expand" and explore an empty 
label. This may affect the number of labels #L. 



7.4.1 Improved large move making algorithms 
Boykov et al. (2001|) introduced a very effective method for multi-label energy 



minimization that makes large search steps by iteratively solving binary sub- 
problems. There are two large move making algorithms: a-expand and a/3- 
swap that differ by the binary sub-problem they solve, a-expand consider for 
each variable whether it is better to retain its current label or flip it to label 
a. The binary step of a/3-swap involves only variables that are currently 
assigned to labels a or /3, and consider whether it is better to retain their 
current label or switch to either a or /3. Defined for submodular energies, 
the binary step in these algorithms is solved using graph-cut. 

We propose new optimization algorithms: Expand- and- Explore and Swap- 
and-Explore, that can cope with the challenges of the CC energy, (i) For the 
binary step we use a solver that handles non-submodular energies, (ii) We 
incorporate "model selection" into the iterative search to recover the under- 
lying number of clusters k. (iii) In the absence of unary term, a good initial 
labeling is provided to the non-submodular binary solver. 

Binary non-submodular energies can be approximated by an extension of 
graph-cuts: QPBO (Rother et al. ( 2007| )). When the binary energy is non- 



submodular QPBO is not guaranteed to provide a labeling for all variables. 
Instead, it outputs only a partial labeling. How many variables are labeled 
depends on the amount of non-submodular pairs and the relative strength 
of the unary term for the specific energy. When no unary term exists in the 
energy QPBO leaves most of the variables unlabeled. To circumvent this 
behavior we use the "improve" extension of QPBO (denoted by QPBOI): 
This extension is capable of improving an initial labeling to find a labeling 
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Algorithm 2: Swap-and-Explore 
Input: Affinity matrix W eR nXn 
Output: Labeling vector LG {1,2,.. .} n 

Init Li <— 1, i — 1, . . . , n II initial labeling 

repeat 

for a <— 1 ; a < #L ; a + + do 

for 13 <- a ; /3 < #L + 1 ; /3 + + do 
\_ L ^r- Swap(a ; (3) 

until L is unchanged 

#L denotes the number of different labels in L. 

Swap(a ; f3)\ swapping labels a and f3 using QPBOI. 

By letting f3 = #L + 1 the algorithm explore new number of clusters, 

this may affect the number of labels #L. 



with lower energy ( |Rother et al ( 2007[ )). In the context of expand and swap 



algorithms a natural initial labeling for the binary steps is to use the current 
labels of the variables and use QPBOI to improve on it, ensuring the energy 
does not increase during iterations. 

To overcome the problem of finding the number of clusters k our algo- 
rithms do not iterate over a fixed number of labels, but explore an "empty" 
cluster in addition to the existing clusters in the current solution. Exploring 
an extra empty cluster allows the algorithms to optimize over all solutions 
with any number of clusters k. The fact that there is no unary term in the 
energy makes it straight forward to perform. Alg. [T] and Alg. [2] presents our 
Expand- and- Explore and Swap-and-Explore algorithms in more detail. 

7.4.2 Adaptive-label ICM 

Another discrete energy minimization method that we modified to cope with 
the three challenges of the CC optimization is ICM (Besag ( 1986[ )). It is a 



point- wise greedy search algorithm. Iteratively, each variable is assigned the 
label that minimizes the energy, conditioned on the current labels of all the 
other variables. ICM is commonly used for MAP estimation of energies with 
a fixed number of labels. Here we present an adaptive-label ICM: using the 
ICM conditional iterations we adaptively determine the number of labels k. 
Conditioned on the current labeling, we assign each point to the cluster it is 
most attracted to, or to a singleton cluster if it is repelled by all. 
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In this section we proposed a new perspective on CC optimization. Inter- 
preting it as MAP estimation of Potts energy allows us to propose a variety 
of efficient optimization method^} 

• Swap-and-Explore (with binary step using QPBOI) 

• Expand-and-Explore (with binary step using QPBOI) 

• Adaptive-label ICM 

Our proposed approach has the following advantages: 
(i) It solves only for n integer variables. This is by far less than the number 



of variables required by existing methods described in Sec. 7.2, which require 



~ n variables of the adjacency matrix X = UU . It makes our approach 
capable of dealing with large number of variables (> 100X) and suitable for 
pixel-level image segmentation. 

(ii) The algorithms solve directly for the cluster membership of each point, 
thus there is no need for rounding scheme to extract U from the adjacency 
matrix X. 

(iii) The number of clusters k is optimally determined by the algorithm 
and it does not have to be externally supplied like in many other cluster- 
ing/segmentation methods. 



In their work Eisner and Schudy (2009) proposed a greedy algorithm 



to optimize the CC functional over complete graphs. Their algorithm is in 
fact an ICM method presented outside the proper context of CRF energy 
minimization, and thus does not allow to generalize the concept of discrete 
optimization to more recent optimization methods. 



7.5 Experimental Results 

This section evaluates the performance of our proposed optimization algo- 
rithms using both synthetic and real data. We compare to both existing dis- 
crete optimization algorithms that can handle multi-label non-submodular 



energies (TRW-S of Kolmogorov (2006) and BP of |Pearl (1988) 3 ), and to 



existing state-of-the-art CC optimization method of Vitaladevuni and Basri 



(2010). Since existing CC optimization methods do not scale beyond several 



hundreds of variables, extremely small matrices are used in the following ex- 
periments. Interactive segmentation results (Sec. 4.3) evaluated our method 
on large scale problems. 



2 Matlab implementation available at: |http : //www . wisdom . w eizmann . ac . il/~bagon/| 
mat la b. html [ 

* Since these two algorithms work only with pre-defined number of clusters fc, we over- 
estimate k and report only the number of non empty clusters in the solution. 
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(a) Energy (lower=better) (b) Recovered k (GT in dashed) 




sparsity [%] sparsity [%] 

Legend:Swap-and-Explore, Expand-and-Explore, ICM, 1 RW-S, BP 

Fig. 7.1: Synthetic results: Graphs comparing (a) Energy at convergence. 

(b) Recovered number of clusters, (c) Purity of resulting clusters, (d) Run 
time of algorithms (log scale). TR\ and BP are almost indistinguish- 
able, as are Swap and Expand in most of the plots. 

7.5.1 Synthetic data 

This experiment uses synthetic affinity matrices W to compare our algo- 
rithms to existing Potts optimization algorithms. The synthetic data have 
750 variables randomly assigned to 15 clusters with different sizes (ratio be- 
tween larger to smaller cluster: ~ x5). For each variable we sampled roughly 
the same number of neighbors: of which ~ 25% are from within the cluster 
and the rest from the other clusters. We corrupted the clean ground-truth 
adjacency matrix with 20% noise affecting both the sign of and the cer- 
tainty (i.e., |Wij|). Overall the resulting percent of positive (submodular) 
connections is ~ 30%. 

We report several measurements for these experiments: run-time, energy 
(^cc)? purity of the resulting clusters and the recovered number of clusters 
k for each of the algorithms as a function of the sparsity of the matrix W , 
i.e., percent of non-zero entries. Each experiment was repeated 10 times with 
different randomly generated matrices. 

Fig. |7.1| shows results of the synthetic experiments. Existing multi-label 
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approaches (TRW-S and BP) do not perform too well: higher Ecc-> lower pu- 
rity and incorrect recovery of k. This demonstrates the difficulty of the energy 
minimization problem that has no unary term and many non-submodular 
pair- wise terms. These results are in accordance with the observations of 



Kolmogorov and Wainwright (2005) when the energy is hard to optimize. 

For our large move making algorithms, Expand-and- Explore provides 
marginally better clustering results than the Swap-and-Explore. However, 
its relatively slow running time makes it infeasible for large CC problem^} 
A somewhat surprising result of these experiments shows that for matrices 
not too sparse (above 10%), adaptive-label ICM performs surprisingly well. 
In fact, it is significantly faster than all the other methods and manages to 
converge to the correct number of clusters with high purity and low energy. 

From these experiments we conclude that Swap-and-Explore (Alg. [2]) is a 
very good choice of optimization algorithm for the CC functional. However, 
when the affinity matrix W is not too sparse, it is worth while giving our 
adaptive-label ICM a shot. 

7.5.2 Co-clustering data 
The following experiment compares our algorithms with a state-of-the-art 



CC optimization method, R-LP, of Vitaladevuni and Basri (2010). For this 



comparison we use affinity matrices computed for co-segmentation. The co- 
segmentation problem can be formulated as a correlation clustering problem 



with super pixels as the variables (Glasner et al. ( 2011|)). 



We obtained 77 affinity matrices, courtesy of Glasner et al. (2011), used 
in their experiments. The number of variables in each matrix ranges from 
87 to 788, Their sparsity (percent of non-zero entries) ranges from 6% to 
50%, and there are roughly the same number of positive (submodular) and 
negative (non-submodular) entries. 



Table |7.1| shows the ratio between our energy and the energy of R-LP 
method. The table also shows the percent of matrices for which our algo- 
rithms found a solution with lower energy than R-LP. The results show the 
superiority of our algorithms to existing multi-label energy minimization ap- 
proaches (TRW-S and BP). Furthermore, it is shown that our methods are 
in par with existing state-of-the-art CC optimization method on small prob- 
lems. However, unlike existing methods, our algorithms can be applied to 
problems two orders of magnitude larger. Optimizing directly for U not only 



4 This difference in run time between Expand and Swap can be explained by looking at 
the number of variables involved in each of the binary steps carried out: For the expand 
algorithm, each binary step involves almost all the variables, while the binary swap move 
considers only a small subset of the variables. 
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Ours 










Swap 


Expand 


ICM 


TRWS 


BP 


Energy ratio 


98.6 


98.4 


77.4 


83.8 


83.6 


(%) 


±1.4 


±1.9 


±23.9 


±5.4 


±6.3 


Strictly lower 
(> 100%) 


15% 


11.7% 












Tab. 7.1: Comparison to Glasner et al. (2011): Ratio between our energy 



and of Glasner et al.; Since all energies are negative, higher ratio means 
lower energy. Ratio higher than 100% means our energy is better than 
Glasner et al.. Bottom row shows the percentage of cases where each 
method got strictly lower energy than Glasner et al.. 



did not compromise the performance of our method, but also allows us to 
handle large scale CC optimization, as demonstrated in the next section. 



7.6 Conclusion 



This work suggests a new perspective on the functional, viewing it as a dis- 
crete Potts energy. The resulting energy minimization presents three chal- 
lenges: (i) the energy is non submodular, (ii) the number of clusters is not 
known in advance, and (iii) there is no unary term. We proposed new energy 
minimization algorithms that can successfully cope with these challenges. 



8. DISCRETE MULTISCALE OPTIMIZATION^ 



This chapter presents a unified multiscale framework for discrete energy min- 
imization that directly acts on the energy. Our approach utilizes algebraic 
multiscale principles to efficiently explore the discrete solution space. The 
main goal of our framework is to improve optimization performance for chal- 
lenging, non-submodular energies for which current methods provide unsat- 
isfactory approximations. Furthermore, the ability to derive a multiscale 
pyramid directly from an energy makes our framework application indepen- 
dent. Two important implications rise from this independence: (i) One no 
longer needs to tailor a multiscale scheme to suit each different application, 
(ii) Our framework makes it trivial to turn existing single scale optimization 
algorithms into powerful multiscale methods. Our framework gives rise to 
two complementary energy coarsening strategies: one in which coarser scales 
involve fewer variables, and a more revolutionary one in which the coarser 
scales involve fewer discrete labels. We empirically evaluated our unified 
framework on a variety of both non-submodular and submodular energies, 
including energies from the Middlebury benchmark. 



8.1 Introduction 



Discrete energy minimization is ubiquitous in computer vision, and spans a 
variety of problems such as segmentation, denoising, stereo, etc. Unfortu- 
nately, apart from the submodular binary case, minimizing these energies 
is known to be NP-hard. A lot of effort is recently put on developing al- 
gorithms for approximate discrete optimization for ever more challenging 



energies: multi-label, non-submodular, etc. (e.g., Szeliski et al (2008); Kol- 



mogorov 



fl2006| ); |Bagon and Galun| ( [2011] )). 
Discrete energies may be grossly divided into two categories: submodular 
(regular) energies and non-submodular energies. Submodular energies are 
characterized by smoothness-encouraging pair- wise (and higher order) terms. 
These energies reflect the "piece- wise constant" prior that is very popular and 



1 This is joint work with Meirav Galun. It was published in the b th NIPS workshop on 
optimization for machine learning, |2Q12[ 
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Fig. 8.1: A Unified multiscale framework: We derive multiscale representa- 
tion of the energy itself = energy pyramid. Our multiscale framework is 
unified in the sense that different problems with different energies share 
the same multiscale scheme, making our framework widely applicable and 
general. 



common in computer vision applications. For this reason most of the effort 
and research regarding discrete optimization, in the context of computer 
vision, focuses on these energies with encouraging results. In practice, despite 
the NP-hardness of these energies, algorithms were developed that provide 
solutions with energies close to global optimum (e.g., |Kolmogorov (2006); 
Boykov et al. ( 2001| )). Therefore we consider this type of energies as "easy 



to optimize" . 

In contrast, non-submodular energies are characterized by contrast-encouraging 
pair wise terms. These energies may be encountered when the parameters 
of the energy are learned (e.g., Nowozin et al. ( 2011[)), or when different 
functionals are used (e.g., |Bagon and Galun (201lf~ [Glasner et al. ( 201 1| )) . 
When it comes to optimization it is generally considered a more challenging 
task to optimize a non-submodular energies. Since these examples of non- 
submodular energies are only recently explored, their optimization receives 
less attention, and consequently, the existing optimization methods provide 
approximations that may be quite unsatisfactory. We consider these energies 
as "hard to optimize" . 

Algorithms for discrete energy minimization may also be classified into 
two categories: primal methods and dual methods. Primal methods act di- 
rectly on the discrete variables in the label space to minimize the energy 
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Multiscale landscape 



mulate a dual problem to the energy and maximize a lower bound to the 
sought energy (e.g., Kolmogorov ( 2006| )). Dual methods are recently con- 
sidered more favorable since they not only provide an approximate solution, 
but also provide a lower bound on how far this solution is from the global 
optimum. Furthermore, if a labeling is found with energy equals to the lower 
bound a certificate is provided that the global optimum was found. 

Since most of the relevant discrete optimization problems are NP-hard, 
one can only provide an empirical evaluation of how well a given algorithm 
approximates representative instances of these energies. For the submodular, 
"easy to optimize", energies it was shown (by Szeliski et al. ( 2008[ )) that dual 
methods tend to provide better approximations with very tight lower bounds. 

But what makes discrete energy min- 
imization such a challenging endeavor? 
The fact that this minimization implies 
an exploration of an exponentially large 
search space makes it such a challenging 
task. One way to alleviate this difficulty 
is to use multiscale search. The illustra- 
tion on the right shows a toy "energy" 
E(L) at different scales of detail. Con- 
sidering only the original scale (s = 0), it is very difficult to suggest an ef- 
fective exploration/optimization method. However, when looking at coarser 
scales (s = 1, ... ,3) of the energy an interesting phenomenon is revealed. 
At the coarsest scale (s = 3) the large basin of attraction emerges, but with 
very low accuracy. As the scales become finer (s = 2, . . . , 0), one "loses sight" 
of the large basin, but may now "sense" more local properties with higher 
accuracy. We term this well known phenomenon as the multiscale landscape 
of the energy. This multiscale landscape phenomenon encourages coarse-to- 
fine exploration strategies: starting with the large basins that are apparent 
at coarse scales, and then gradually and locally refining the search at finer 
scales. 

For more than three decades the vision community has focused on the 




s-O (original) 



multiscale pyramid of the image (e.g., Lucas and Kanade (1981); Burt and 
Adelson (1983)). There is almost no experience and no methods that apply 
a multiscale scheme directly to the discrete energy. 

In this paper we present a novel unified discrete multiscale optimization 



scheme that acts directly on the energy (Fig. 8.1). Our approach allows for an 
efficient exploration of the discrete solution space through the construction of 
an energy pyramid. Moreover, our multiscale framework is application inde- 
pendent: different problems with different energies share the same multiscale 
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scheme, making our framework widely applicable and general. 

Performing empirical evaluations of non-submodular energies minimiza- 
tion lead us to conclude that when it comes to hard to optimize non-submodular 
energies, primal methods tend to provide better approximations than dual 
methods. Motivated by this observation, we formulate out multiscale frame- 
work in the primal space (i.e., expressing it in terms of the variables and 
labels directly). Our multiscale framework becomes the core of the opti- 
mization process allowing for existing "off-the-shelf primal optimization al- 
gorithms to efficiently exploit the multiscale landscape of the energy and 
achieves significantly lower energies faster. 

This work makes several contributions: 

(i) A novel unified multiscale framework for discrete optimization. A wide 
variety of optimization problems, including segmentation, stereo, de- 
noising, correlation-clustering, and others share the same multiscale 
framework. 

(ii) Energy-aware coarsening scheme. Variable aggregation takes into ac- 
count the underlying structure of the energy itself, thus efficiently and 
directly exposes its multiscale landscape. 

(iii) Coarsening the labels. Our formulation allows for variable aggregation 
as well as for label coarsening. This yields an energy pyramid with 
fewer labels at the coarser scales. 

(iv) Integrating existing single-scale optimization algorithms into our multi- 
scale framework. We achieve significantly lower energy assignments on 
diverse computer vision energies, including challenging non-submodular 
examples. 

(v) Optimizing hard non-submodular energies. Using several classes of 
non-submodular energies, we empirically exemplify the superiority of 
primal methods. We further show how combining in our multiscale 
framework single-scale primal optimization methods achieve increased 
optimization performance on these challenging problems. 



8.1.1 Related work 



There are very few works that apply multiscale schemes directly to the en- 
ergy. A prominent example for this approach is that of |Felzenszwalb and 



Huttenlocher (2006), that provide a coarse-to-fine belief propagation scheme 



restricted to regular diadic pyramid. A more recent work is that of Ko- 



modakis (2010) that provides an algebraic multigrid formulation for discrete 
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optimization in the dual space. However, despite his general formulation |Ko- 
modakis| only provides examples using regular diadic grids of easy to optimize 
submodular energies. 



The work of Kim et al. (2011) proposes a two-scales scheme mainly aimed 



at improving run-time of the optimization process. Their proposed coarsen- 
ing strategies can be interpreted as special cases of our unified framework. 



We analyze their underlying assumptions (Sec. 8.3.1), and suggest better 



methods for efficient exploration of the multiscale landscape of the energy. 
A different approach for discrete optimization suggests large move mak- 



ing algorithms (e.g., Boykov et al. (2001); Swendsen and Wang (1987)). We 
experimentally show how plugging such methods within our multiscale frame- 
work improves optimization results. These methods do not scale gracefully 



with the number of labels. Lempitsky et al. (2007) proposed a method to 



exploit known properties of the metric between the labels to allow for faster 
minimization of the energy. However, their method is restricted to energies 
with clear and known label metrics and requires training. In contrast, our 
framework addresses this issue via a principled scheme that builds an energy 
pyramid with a decreasing number of labels without prior training and with 
fewer assumptions on the labels interactions. 



8.2 Multiscale Energy Pyramid 

In this work we consider discrete pair- wise minimization problems, defined 
over a (weighted) graph (V,£), of the form: 



E(L) 



^2 & (k 



+ 



(i,j)e£ 



Wij 



1.1) 



where V is the set of variables, £ is the set of edges, and the solution is 
discrete: L e {l,...,/} n , with n variables taking I possible labels. Many 



problems in computer vision are cast in the form of (8.1) (see Szeliski et 
al. (2008)). Furthermore, we do not restrict the energy to be submodular, 



and our framework is also applicable to more challenging non-submodular 
energies. 

Our aim is to build an energy pyramid with a decreasing number of 
degrees of freedom. The key component in constructing such a pyramid is 
the interpolation method. The interpolation maps solutions between levels of 
the pyramid, and defines how to approximate the original energy with fewer 
degrees of freedom. We propose a novel principled energy aware interpolation 
method. The resulting energy pyramid exposes the multiscale landscape of 
the energy making low energy assignments apparent at coarse levels. 
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However, it is counter intuitive to directly interpolate discrete values, 
since they usually have only semantic interpretation. Therefore, we substi- 
tute an assignment L by a binary matrix U G {0, l} nxl . The rows of U 
correspond to the variables, and the columns corresponds to labels: U^ a = 1 
iff variable i is labeled "a" (li = a). This representation allows us to inter- 
polate discrete solutions, as will be shown in the subsequent sections. 



Expressing the energy (8.1) using U yields a relaxed quadratic represen- 



tation (along the lines of Rangarajan ( 2000| )) that forms the basis for our 
multiscale framework derivation: 



E (U) = Tr (DU T + WUVU T ) 

i 

s.t. U e {0, l} nxl , Y,Ui* = l 



12) 
13) 



a=l 



where W is sparse with entries D G M nxZ s.t. Di ia =ipi(a), and V G 



R lxl s.t. Vctfi^ij) (a, a, /3 G {1, ...,/}. A detailed derivation of (8.2) can 
be found in Sec. 18. Al 

An energy over n variables with I labels is now parameterized by (n, /, D ) W, V). 

We first describe the energy pyramid construction for a given interpola- 
tion matrix P, and defer the detailed description of our novel interpolation 
to Sec. 18.31 



Energy coarsening by variables 

Let (n f J,D f ,W f , V) be the fine scale energy. We wish to generate a coarser 
representation (n c , /, D c ) W c , V) with n c < rJ . This representation approxi- 
mates E (Uf) using fewer variables: U c with only n c rows. 

Given an interpolation matrix P G [0, i] n/xn s .t. ^ . P^ = 1 Vi, it maps 
coarse to fine assignments through: 

U f « PU C (8.4) 

For any fine assignment that can be approximated by a coarse assignment 
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U c we may plug ( |8.4[ ) into (8.2) yielding: 

E(U f ) = Tr(p f U fT + W f U f VU fT ^ 

« Tr (D f U cT P T + W f PU c VU cT P T ) 

= Tr( (P T D f ) U cT + (P T W f P) U c VU cT ) 



=W C 



Tr (D c U cT + W c U c VU cT ) 
E(U C ) 



(8.5) 



We have generated a coarse energy E (U c ) parameterized by (n c , /, D c ) W c ) V) 
that approximates the fine energy E{U^). This coarse energy is of the same 
form as the original energy allowing us to apply the coarsening procedure 
recursively to construct an energy pyramid. 



Energy coarsening by labels 

So far we have explored the reduction of the number of degrees of freedom 
by reducing the number of variables. However, we may just as well look 
at the problem from a different perspective: reducing the search space by 
decreasing the number of labels from If to l c (l c < If). It is a well known fact 
that optimization algorithms (especially large move making, e.g., Boykov et 
al. ( 2001[ )) suffer from significant degradation in performance as the number 
of labels increases (Bleyer et al. ( |2010| ) ) . Here we propose a novel principled 
and general framework for reducing the number of labels at each scale. 

Let ( n, V ^ W, V* ) be the fine scale energy. Looking at a different 



interpolation matrix Pg [0,1]' X ' , we may interpolate a coarse solution 
by Iff ~ U C P T . This time the interpolation matrix P acts on the labels, i.e., 
the columns of U. The coarse labeling matrix U c has the same number of 
rows (variables), but fewer columns (labels). We use □ notation to emphasize 
that the coarsening here affects the labels rather than the variables. 
Coarsening the labels yields: 



E (U d ) = Tr ( (D f P) U dT + WU d (P T V f P) U 



r r ± 



(8.6) 



Again, we end up with the same type of energy, but this time it is defined 
over a smaller number of discrete labels: (n, / c , D c , W, V c ), where D C =D^ P 

and V £ =P T VfP. 
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.7 .3" 
1 

1 

.2 .8. 

Fig. 8.2: Interpolation as soft variable aggregation: fine variable 1, 2 and 
4 are aggregated into coarse variable 1, while fine variables 1,3 and 4 
are aggregated into coarse variable 2. Soft aggregation allows for fine 
variables to be influenced by few coarse variables, e.g.: fine variable 1 
is a convex combination of .7 of 1 and .3 of 2. Hard aggregation is a 
special case where P is a binary matrix. In that case each fine variable 
is influenced by exactly one coarse variable. 




The main theoretical contribution of this work is encapsulated in the 
multiscale "trick" of equations (8.5) and (8.6). This formulation forms the 
basis of our unified framework allowing us to coarsen the energy directly and 
exploits its multiscale landscape for efficient exploration of the solution space. 
This scheme moves the multiscale completely to the optimization side and 
makes it independent of any specific application. We can practically now 
approach a wide and diverse family of energies using the same multiscale 
implementation. 

The effectiveness of the multiscale approximation of (8.5) and (8.6) heav- 
ily depends on the interpolation matrix P (P resp.). Poorly constructed in- 
terpolation matrices will fail to expose the multiscale landscape of the func- 
tional. In the subsequent section we describe our principled energy-aware 
method for computing it. 



8.3 Energy-aware Interpolation 

In this section we use terms and notations for variable coarsening (P), how- 
ever the motivation and methods are applicable for label coarsening (P) as 



well due to the similar algebraic structure of (8.5) and (8.6). 

Our energy pyramid approximates the original energy using a decreasing 
number of degrees of freedom, thus excluding some solutions from the original 
search space at coarser scales. Which solutions are excluded is determined by 
the interpolation matrix P. A desired interpolation does not exclude 
low energy assignments at coarse levels. 

The matrix P can be interpreted as an operator that aggregates fine- 



scale variables into coarse ones (Fig. 8.2). Aggregating fine variables i and j 
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into a coarser one excludes from the search space all assignments for which 
li 7^ lj. This aggregation is undesired if assigning i and j to different labels 
yields low energy. However, when variables i and j are strongly correlated 
by the energy (i.e., assignments with li = lj yield low energy), aggregating 
them together efficiently allows exploration of low energy assignments. A 
desired interpolation aggregates i and j when i and j are strongly 
correlated by the energy. 



8.3.1 Measuring energy-aware correlations 

We provide two correlations measures, one used in computing variable coars- 
ening (P) and the other used for label coarsening (P). 

Energy-aware correlations between variables: A reliable estimation 
for the correlations between the variables allows us to construct a desirable 
P that aggregates strongly correlated variables. A naive approach would 
assume that neighboring variables are correlated (this assumption underlies 
Felzenszwalb and Huttenlocher ( 2006| )). This assumption clearly does not 



hold in general and may lead to an undesired interpolation matrix P. |Kim et 



al. (2011) proposed several "closed form formulas" for energy- aware variable 
grouping. However, their formulas take into account either the unary term 
or the pair-wise term. Indeed it is difficult to decide which term dominates 
and how to fuse these two terms together. Therefore, there is no "closed 
form" method that successfully integrates both of them. 

As opposed to these "closed form" methods, we propose a novel empirical 
scheme for correlation estimation. Empirical estimation of the correlations 
naturally accounts for and integrates the influence of both the unary and 



the pair-wise terms. Moreover, our method, inspired by Ron et al. (2011); 



Livne and Brandt (2011), extends to all energies (8.2): submodular, non- 
submodular, metric V ) arbitrary V , arbitrary W : energies defined over reg- 
ular grids and arbitrary graphs. 

Variables i and j are correlated by the energy when ^ = lj yields relatively 
low energy value. To estimate these correlations we empirically generate 
several "locally" low energy assignments, and measure the label agreement 
between neighboring variables i and j. We use Iterated Conditional Modes 



(ICM) of Besag| ( |1986[ ) to obtain locally low energy assignments: Starting 



with a random assignment, ICM chooses, at each iteration and for each 
variable, the label yielding the largest decrease of the energy function, con- 
ditioned on the labels assigned to its neighbors. 

Performing t = 10 ICM iterations for K = 10 random initializations pro- 
vides K locally low energy assignments {L fc } fcl . Our empirical dissimilarity 
between i and j is given by dy = V l k^ l k ) and their correlation is given 
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by Cij exp ( — -^-J , with a oc maxV. 

It is interesting to note that strong correlation between variables i and 
j usually implies that the pair- wise term binding them together (tpij) is a 
smoothness-preserving type of relation. We assume that even for challenging 
energies with many contrast-enhancing pair- wise terms, there are still signifi- 
cant amount of smoothness-preserving terms to allow for effective coarsening. 
Energy-aware correlations between labels: Correlations between labels 
are easier to estimate, since this information is explicit in the matrix V that 
encodes the "cost" (i.e., dissimilarity) between two labels. Setting c a ^ oc 

(Va,p^ ? we g e t a "closed-form" expression for the correlations between 
labels. 



8.3.2 From correlations to interpolation 

Using our measure for the variable correlations, c^, we follow the Algebraic 
Multigrid (AMG) method of Brandt ( |1986[ ) to compute an interpolation ma- 



trix P that softly aggregates strongly correlated variables. 

We begin by selecting a set of coarse representative variables V c C V^, 
such that every variable in V^\V C is strongly correlated with V c . That is, 
every variable in is either in V c or is strongly correlated to other variables 
in V c . A variable i is considered strongly correlated to V c if S jG y c Qj > 
(3^2 - e yf c ij- ft affects the coarsening rate, i.e., the ratio n c /n^ smaller f3 
results in a lower ratio. We perform this selection greedily and sequentially, 
starting with V c = adding i to V c if it is not yet strongly correlated to V c . 

Given the selected coarse variables V c , maps indices of variables from 
fine to coarse: is the coarse index of the variable whose fine index is j 



(in Fig. |8.2[ 1(2) = 1 and 1(3) = 2). The interpolation matrix P is defined 
by: 

f i e V f \V c , j e V c 
Pu(j) = \ 1 ieV c ,j = i (8.7) 
[ otherwise 

We further prune rows of P leaving only 5 maximal entries. Each row is 
then normalized to sum to 1. Throughout our experiments we use (3 — 0.2 
($ = 0.75), 5 = 3 (5 = 2) for computing P (P resp.). 



8.4 Unified Discrete Multiscale Framework 



So far we have described the different components of our multiscale frame- 
work. Alg. [3] puts them together into a multiscale minimization scheme. 
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Algorithm 3: Discrete multiscale optimization. 



Input: Energy (V°, D°, W°, V) . 
Output: U° 

Init s <— 0// fine scale 

// Energy pyramid construction: 

while I VI > 10 do 



Estimate pair- wise correlations at sc ale s (Sec. 8.3.1) 



Compute interpolation matrix P s (Sec. 8.3.2). 
Derive coarse energy (V s+1 , D S+1 ,W S+1 \V) (Eq. 8.5). 
s + + 

// Coarse-to-fine optimization: 
while s > do 

U s <- Refine (U s ) 

fja-i = psjjs// interpolate a solution 

s 

where Refine (U s ) uses an "off-the-shelf algorithm to optimize the 
energy (V s , D s , W s , V) with II s as an initialization. 



Given an energy (V, D,W, V), our framework first works fine-to-coarse to 
compute interpolation matrices {P s } that generates an "energy pyramid". 
Typically we end up at the coarsest scale with less than 10 variables. As a 
result, exploring the energy at this scale is robust to the initial assignment 
of the single-scale method usec^J 

Starting from the coarsest scale, a coarse solution at scale s is interpolated 
to a finer scale 5 — 1. At the finer scale it serves as a good initialization 
for an "off-the-shelf single-scale optimization that refines this interpolated 
solution. These two steps are repeated for all scales from coarse to fine. 

The inte rpol ated solution {7 s-1 , at each scale, might not satisfy the binary 
constraints (8.3). We round each row of U s ~ l by setting the maximal element 
to 1 and the rest to 0. 

The most computationally intensive modules of our framework are the 
empirical estimation of the variable correlations and the single-scale opti- 
mization used to refine the interpolated solutions. The complexity of the 
correlation estimation is O (\£ \ • /), where \£\ is the number of non-zero ele- 
ments in W and I is the number of labels. However, it is fairly straightforward 
to parallelize this module. 



2 In practice we use "winner-take-all" initialization as suggested by (Szeliski et al. t 2008 
§3.1). 
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It is now easy to see how our framework generalizes 



Huttenlocher (2006), Komodakis (2010) and Kim et al. (2011): They are 



restricted to hard aggregation in P. Felzenszwalb and Huttenlocher (2006) 



'elzenszwalb and 



and Komodakis| fl2010| ) use a multiscale pyramid, however their variable ag- 
gregation is not energy-aware, and is restricted to diadic pyramids. On the 
other hand, Kim et al. (2011) have limited energy-aware aggregation, applied 
to a two level only "pyramid" . They only optimize at the coarse scale and 
cannot refine the solution on the fine scale. 



8.5 Experimental Results 



Our experiments has two main goals: first, to stress the difficulty of approxi- 
mating non-submodular energies and to show the advantages of primal meth- 
ods for this type of minimization problems. The other goal is to demonstrate 
how our unified multiscale framework improved the performance of existing 
single-scale primal methods. 

We evaluated our multiscale framework on a diversity of discrete opti- 
mization task^j ranging from challenging non-submodular synthetic and 
co-clustering energies to low-level submodular vision energies such as denois- 
ing and stereo. In addition we provide a comparison between the different 



methods for measuring variable correlations that were presented in Sec. 8.3.1 



We conclude with a label coarsening experiment. In all of these experiments 
we minimize a given publicly available benchmark energy, we do not attempt 
to improve on the energy formulation itself. 

We use ICM (Besag ( 1986|)), a /3-swap and a-expansion (large move mak- 
ing algorithms of Boykov et al] (2001)) as representative single-scale "off- 
the-shelf primal optimization algorithms. To help large move making al- 
gorithms to overcome the non-submodularity of some of these energies we 



augment them with QPBO(I) of|Rother et aZ.|Q2007p . 

We follow the protoc ol of |Szeliski et al. (2008) that uses the lower bound 
of TRW-S ([Kolmogorov (2006)) as a baseline for comparing the performance 
of different optimization methods for different energies. We report how close 
the results (in percents) to the lower bound: closer to 100% is better. 

We show a remarkable improvement for ICM combined in our multiscale 
framework compared with a single-scale scheme. For the large move making 
algorithms there is a smaller but consistent improvement of the multiscale 
over a single scale scheme. TRW-S is a dual method and is considered state- 



of-the-art for discrete energy minimization Szeliski et al. (2008). However, we 



show that when it comes to non-submodular energies it struggles behind the 



3 code available at 



www. wisdom. we izmann. ac . il/~bagon/mat lab. html 
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IC 
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M 
single 
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Ours 


ap 
single 
scale 


Exp 
Ours 


and 
single 
scale 


TRW-S 


5 

10 
15 


112.6% 
123.6% 
127.1% 


115.9% 
130.2% 
135.8% 


108.9% 
118.5% 
122.1% 


110.0% 
120.2% 
124.1% 


110.5% 
121.5% 
124.6% 


110.0% 
121.0% 
125.1% 


116.6% 
134.6% 
138.3% 



Tab. 8.1: Synthetic results (energy): Showing percent of achieved energy value 
relative to the lower bound (closer to 100% is better) for ICM, a/3-swap, 
a-expansion and TRW-S for varying strengths of the pair-wise term (X = 
5, . . . , 15, stronger — >> harder to optimize.) 



large move making algorithms and even ICM. For these challenging energies, 
multiscale gives a significant boost in optimization performance. 

8.5.1 Synthetic 

We begin with synthetic non-submodular energies defined over a 4-connected 
grid graph of size 50 x 50 (n = 2500), and I = 5 labels. The unary term 
D ~ N (0,1). The pair-wise term V a p = Vp a ~ W(0,1) (V aa = 0) and 
Wij = Wji ~ A • U (— 1, 1). The parameter A controls the relative strength of 
the pair- wise term, stronger (i.e., larger A) results with energies more difficult 
to optimize (see Kolmogorov ( |2006D ) . Table |8j] shows results, averaged over 



100 experiments. 

The resulting synthetic energies are non-submodular (since may be- 
come negative). For these challenging energies, state-of-the-art dual method 
(TRW-S) performs rather poorljj^] (worse than single scale ICM) and there 
is a significant gap between the lower bound and the energy of the actual 
primal solution provided. Among the primal methods used, These results 
motivate our focusing on primal methods, especially a/5-swap. 

8.5.2 Chinese character inpainting 



We further experiment with learned binary energies of ( |Nowozin et al. , 2011 
§5. 2)0 These 100 instances of non-submodular pair-wise energies are de- 
fined over a 64-connected grid. These energies were designed and trained 
to perform the task of learning Chinese calligraphy, represented as a com- 
plex, non-local binary pattern. Despite the very large number of parameters 



4 We did not restrict the number of iterations, and let TRW-S run until no further 
improvement to the lower bound is made. 



5 available at www.nowozin.net/sebastian/papers/DTF_CIP_instances.zip 
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Fig. 8.3: Chinese characters inpainting: Visualizing some of the instances used 
in our experiments. Columns are (left to right): The original character 
used for testing. The input partially occluded character. ICM and QPBO 
results both our multiscale and single scale results. Results of TRW-S 



and results of Nowozin et al. (2011) obtained with a very long run of 
simulated annealing (using Gibbs sampling inside the annealing). 





ICM 


QPBO 


TRW-S 




Ours 


single-scale 


Ours 


single-scale 


(a) 


114.0% 


114.0% 


97.8% 


106.2% 


108.6% 


(b) 


7.0% 


7.0% 


77.0% 


34.0% 


25.0% 



Tab. 8.2: Energies of Chinese characters inpainting: table showing (a) mean 
energies for the inpainting experiment relative to baseline of \Nowozin et| 
al] (201l\ ) (I ower is better, less than 100% = lower than baseline), (b) 



percent of instances for which strictly lower energy was achieved. 
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Ours | single scale I Ours | single scale 
QPBO ICM 



TRW-S 



Fig. 8.4: Energies of Chinese characters inpainting: Box plot showing 25%, 
median and 75% of the resulting energies relative to reference energies 
of Nowozin et al. (2011]} (lower than 100% = lower than baseline). Our 
multiscale approach combined with QPBO achieves consistently better en- 
ergies than baseline, with very low variance. TRW-S improves on only 
25% of the instances with very high variance in the results. 



involved in representing such complex energies, learning is conducted very 
efficiently using Decision Tree Field (DTF). The main challenge in these 
models becomes the inference at test time. 

Our experiments show how approaching these challenging energies using 



our unified multiscale framework allows for better approximations. Table [872 
and Fig. |8.3| compare our multiscale framework to single-scale methods acting 
on the primal binary variables. Since the energies are binary, we use QPBO 
instead of large move making algorithms. We also provide an evaluation of a 
dual method (TRW-S) on these energies. In addition to the quantitative re- 



sults, Fig. |8.4| provides a visualization of some of the instances of the restored 
Chinese characters. 

These "real world" energies highlight the advantage primal methods has 
over dual ones when it comes to challenging non-submodular energies. It is 
further clear that significant improvement is made by our multiscale frame- 
work. 

8. 5. 3 Co-clustering 
The problem of co-clustering addresses the matching of superpixels within 



and across frames in a video sequence. Following (Bagon and Galun, 2011 



56.2), we treat co-clustering as a discrete minimization of non-submodular 



Potts energy. We obtained 77 co-clustering energies, courtesy of Glasner 



et al. (2011), used in their experiments. The number of variables in each 



energy ranges from 87 to 788. Their sparsity (percent of non-zero entries 
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71.8% 
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0.5% 



Tab. 8.3: Co-clustering results: 



Baseline for comparison are state-of-the-art 
results of Glasner et al. (2011). (a) We report our results as percent 
of the baseline: smaller is better, lower than 100% even outperforms 
state-of-the-art. (b) We also report the fraction of energies for which 
our multiscale framework outperform state-of-the-art. 



in W) ranges from 6% to 50%, The resulting energies are non-submodular, 
have no underlying regular grid, and are very challenging to optimize |Bagon| 
and Galun|([20lT1). 



Table 8J3 compares our discrete multiscale framework combined with ICM 
and a/3-swap. For these energies we use a different baseline: the state-of-the- 
art results of Glasner et al.\ ( |2011 ) obtained by applying specially tailored 
convex relaxation method (We do not use the lower bound of TRW-S here 
since it is far from being tight for these challenging energies) . Our multiscale 
framework improves state-of-the-art for this family of challenging energies. 



8.5.4 semi-metric energies 

We further applied our multiscale framework to optimize less challenging 
semi-metric energies. We use the diverse low-level vision MRF energies from 
the Middlebury benchmark Szeliski et al. ( 2008| |^ 



For these semi-metric energies, TRW-S (single scale) performs quite well 
and in fact, if enough iterations are allowed, its lower bound converges to 
the global optimum. As opposed to TRW-S, large move making and ICM 
do not always converge to the global optimum. Yet, we are able to show a 
significant improvement for primal optimization algorithms when used within 



our multiscale framework. Tables [874] and [875] and Figs. |8.5] and |8.6| show our 
multiscale results for the different submodular energies. 

One of the conclusions of the Middlebury challenge was that ICM is no 
longer a valid candidate for optimization. Integrating ICM into our multiscale 
framework puts it back on the right track. 



Available at |vi si on. middlebury . edu/MRF/j 
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102.7% 
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102.5% 


234.3% 
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100.3% 


100.5% 



Tab. 8.4: Stereo: Showing percent of achieved energy value relative to the lower 
bound (closer to 100% is better). Visual results for these experiments 
are in Fig. |g.5f Energies from \Szeliski et aL \2008 ). 
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Fig. 8.5: Stereo: 7Vo£e /iou> owr multiscale framework drastically improves ICM 
results, visible improvement for a/3-swap can also be seen in the middle 
row (Venus). Numerical results for these examples are shown in Ta- 
ble\£l\ 
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Tab. 8.5: Denoising and inpainting: Showing percent of achieved energy value 
relative to the lower bound (closer to 100% is better). Visual results for 
these experiments are in Fig. \8.6\ Energies from Szeliski et al. (2008). 
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Fig. 8.6: Denoising and inpainting: Single scale ICM is unable to cope with 
inpainting: performing local steps it is unable to propagate information 
far enough to fill the missing regions in the images. On the other hand, 
our multiscale framework allows ICM to perform large steps at coarse 
scales and successfully fill the gaps. Numerical results for these examples 
are shown in Table\ 



8.5.5 Comparing variable correlation estimation methods 



As explained in Sec. |8.3| the correlations between the variables are the most 
crucial component in constructing an effective multiscale scheme. In this 



experiment we compare our energy-aware correlation measure (Sec. 8.3.1) to 



three methods proposed by Kim et al. (2011): "unary-difP , "min-unary-diff" 
and "mean-compat" . These methods estimate the correlations based either 
on the unary term or the pair-wise term, but not both. We also compare 
to an energy-agnostic measure, that is = 1 Vi, j, this method underlies 



Felzenszwalb and Huttenlocher (2006). We use ICM within our framework 



to evaluate the influence these methods have on the resulting multiscale 
performance for four representative energies. 

Fig. |8.7| shows percent of lower bound for the different energies. Our 
measure consistently outperforms all other methods, and successfully bal- 
ances between the influence of the unary and the pair- wise terms. 



8. 5. 6 Coarsening labels 

a/3-swap does not scale gracefully with the number of labels. Coarsening 
an energy in the labels domain (i.e., same number of variables, fewer la- 
bels) proves to significantly improve performance of a/3-swap, as shown in 
Table |8.6[ For these examples constructing the energy pyramid took only 
milliseconds, due to the "closed form" formula for estimating label correla- 
tions. 
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150 
140 




Fig. 8.7: Comparing correlation measures: Graphs showing percent of lower 
bound (closer to 100% is better) for different methods of computing 
variable- correlations. Some of the bars are cropped at ~ 150%. Our 
energy-aware measure consistently outperforms all other methods. 



Our principled framework for coarsening labels improves a/3-swap perfor- 
mance for these energies. 

8.6 Conclusion 

This work presents a unified multiscale framework for discrete energy min- 
imization that allows for efficient and direct exploration of the multiscale 
landscape of the energy. We propose two paths to expose the multiscale 



Energy 


^labels 
(finest) 


^labels 
(coarsest) 


Ours 


single 
scale 


Penguin 
(denoising) 


256 


67 


103.6% 
128 [sec] 


111.3% 

253 [sec] 


Venus 
(stereo) 


20 


4 


106.0% 
100 [sec] 


128.7% 
130 [sec] 



Tab. 8.6: Coarsening labels (o:/3-swap): Working coarse-to-fine in the labels 
domain. We use 5 scales with coarsening rate of ~ 0.7. Number of 
variables is unchanged. Table shows percent of achieved energy value 
relative to the lower bound (closer to 100% is better), and running times. 
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landscape of the energy: one in which coarser scales involve fewer and coarser 
variables, and another in which the coarser levels involve fewer labels. We 
also propose adaptive methods for energy-aware interpolation between the 
scales. Our multiscale framework significantly improves optimization results 
for challenging energies. 

Our framework provides the mathematical formulation that "bridges the 
gap" and relates multiscale discrete optimization and algebraic multiscale 
methods used in PDE solvers (e.g., Brandt (1986)). This connection allows 
for methods and practices developed for numerical solvers to be applied in 
multiscale discrete optimization as well. 



Appendix 



8. A Derivation of eq. (8.2) 



In this work we consider discrete pair-wise minimization problems of the 
form: 

e (L) = E ie y A (h) + E (i , j)e £ w a • V> (U, h) 

Using the following parameterizations: D £ M. nxl s.t. D i)a =ipi(a), and 
1x1 s.t. V aj p=if) (a, f3) we claim that 
E (U) = Tr (DU T + 

s.t. u e {0,i} ny! 



V e 



.1) is equivalent to: 



(8.2 



WUVU T ) 

- , , , £U U ia = 1 ^ (8.3) 

Assuming both V and W are symmetric]^} Looking at the different 
ponents in (8.2): 



com- 



aj 



a (3 



af3 



^2i/j(a,f3)U ia U j(3 



7 if they are not we need to be slightly more careful with transposing them, but roughly 
similar expression can be derived. 
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Looking at the trace of the second term: 

Tr(WUVU T ) = ^[WUVlF].. 



As for the unary term: 

Tr(DU T ) = ^[DU T l 



i a 

^2^2^Pi (a) U ia 

i a 

(8-9) 



Putting ( |8.9[ ) and ( |8.8[ ) together we get: 

Tr (DU T + WUVU T ) = Tr (DU T ) + Tr (WUVU T ) 

i ij 

Note that the diagonal of W is assumed to be zero: this is a reasonable 
assumption as wu represents an interaction of variable i with itself. This type 
of interaction is well represented by the unary term D^. When coarsening 
the energy it may happen that W c will no longer have zeros on the diagonal. 
This case may arise when a single coarse variable represents neighboring fine 
scale variables. In that case the fine scale pair- wise interaction should be 
absorbed into the coarse scale unary term. It is easy to see that the term 
WiiVaa should be added to the unary term D ia , whenever W has non zeros on 
the diagonal. After this rectification, the non-zeros entries on the diagonal 
of W can be set to zero. 



8.B More General Energy Functions 
This chapter d8l) has focused on the construction of an energy pyramid for 



energies of the form (8.1). However, this form does not cover all possible 
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pair- wise energies. We have used this slightly restricted form throughout 
the paper since it emphasizes the simplicity of the algebraic derivation of 
our multiscale framework. Nevertheless, our multiscale framework can be as 
easily applied to more general energies. 



8.B.1 Pair-wise 



(V, £) can be written 
(8.10) 



A general from for pair- wise energy over a graph Q 
as 

E(L) = ^2<Pi(li) + Wi^h) 

In this more general form, the pair-wise term can be entirely different 
for each pair: y?y (kjj). This is in contrast to the energy 
pair-wise terms differ only by the scaling factor Wij of a single fixed term 
(p(k,lj). The Photomontage energies of (Szeliski et a/., 2008, 



1|) where the 
fixed term 
4.2) are an 
example of such general pair- wise energies. 

Instead of using a pair of matrices V and W to parameterize the pair- 
wise terms, we use a collection of matrices {^ J } ijG £ for the same purpose 
in the general settings. Each matrix V 1 ^ is of size I x I and is defined as 

A general energy is now parameterized by ^n, /, D, {V l i} i ^g S j . 
Coarsening variables: The computation of the interpolation matrix P is 



unchanged. ICM can be applied to energies of the form (8.10) to estimate 



agreement between neighboring variables. These agreements are then used 



to compute the interpolation matrix P (Sec. 8.3). 



In order to write the coarsening of the pair-wise term we need to intro- 
duce some notations: Let z, j be variables at the fine scale, and /, J denote 
variables in the coarse scale. An entry, Pn : in the interpolation matrix indi- 
cates how fine scale variable i is affected by a coarse scale variable /. Given 
an interpolation matrix P we can coarsen the energy by 



D c 



a/3 



= P T D f 



111) 



ije 



The coarse graph, £ c , is defined by all the non-zero pair- wise terms V IJ . The 
coarser energy is now parameterized by 
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Coarsening labels: In this case, the computation of the interpolation ma- 
trix P is not trivial, since we no longer have a single matrix V to derive the 
agreements from. We leave this issue of deriving an efficient interpolation 
matrix for the general case for future work. However, given a matrix P the 
coarsening of labels can be done easily: 

yzjc def pTyijfp y { j & £ (8.12) 

yielding a coarser energy parameterized by 

(n,,,D*,{v»*} m£ ) 
with the same number of variables, n, but fewer labels l c < P . 



It is fairly straight forward to see that the energy (8.1) is a special case 
of the more general form (8.10) and so the coarsening of variables and labels 
of Eq. (8.5) and (8.6) can be seen as special cases of Eq. (8.11) and (8.12) 
resp. 



8.B.2 High-order energies 

Discrete energies may involve terms that are beyond pair-wise: that is, de- 
scribe interaction between sets of variables. These energies are often referred 
to as high-order energies. Examples of such energies can be found in e.g., 



Kohli et al. 


(2009 


); 


Rother et al. 


(2009) 



A high order energy is defined over a hyper- graph (V, S) where the hyper- 
edges are subsets of variables s G S s.t. s C V. 

E{L) = Y, <Pi (h) + Y,<Ps ({k \ies}) (8.13) 

Where the high-order terms cp s are |s|-way discrete functions: 

<p a : {l,...,/} 1 * 1 (8.14) 

A high-order energy of the form ( |8. 13| ) can be parameterized using tensors. 
Each high-order term, y? s , is parameterized by a |<s|-order tensor 

K**,/3,...,c=Vs = OLj j = P,...J m = t\s = {i,j,...,m}) (8.15) 



A high-order energy (8.13) is now parameterized by (n, /, D, {V s } se< s)- 
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Coarsening variables: The computation of the interpolation matrix P is 



unchanged. ICM can be applied to energies of the form (8.13) to estimate 



agreement between neighboring variables. These agreements are then used 



to compute the interpolation matrix P (Sec. 8.3). 



Given an interpolation matrix P we can coarsen the energy by 

D C def pTDf 

yS={I,J,...,M} def p ^ .p. j. y{s=i,j,...,m} y s £ gf (8.16) 

For all coarse variables /, J, . . . , M with non-zero entry P^j, Pjj, .... These 
non-zeros interactions in P defines the coarse scale hyper-edges in S c . Note 
that when two (or more) fine-scale variable z, j, . . . are represented by the 
same coarse variable /, then the size of S (the coarse scale hyper-edge) is 
reduced relative to the size of s (the fine scale hyper-edge). 
Coarsening labels: In this case, the computation of the interpolation ma- 
trix P is not trivial, since we no longer have a clear representation of the 
interactions between the different labels. We leave this issue of deriving an 
efficient interpolation matrix for the general case for future work. However, 
given a matrix P the coarsening of labels can be done easily. Using a,/3 to 
denote coarse scale labels 

D c def D fp 

v*u... = E H.,... A« • A/3 • . . . (8.17) 

ot,P,... 

yielding a coarser energy parameterized by 

(n,l c ,DMV^} seS ) 
with the same number of variables, n, but fewer labels l c <P ' . 
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In this work I motivated the use of challenging discrete energies, beyond 
semi-metric, for different tasks in computer vision. I conclude that despite 
the computational challenges posed by stepping away form the well stud- 
ied smoothness-encouraging energies, the gain in terms of expressiveness and 
quality of solution is worth while. For some applications these energies de- 
scribe better the underlying process and therefore better suited for modeling. 
Moreover, practical algorithms and approaches were proposed in this work 
to cope with the resulting challenging optimization tasks. 

Considering the task of unsupervised clustering as an example, using the 
arbitrary Potts energy of the Correlation Clustering functional allows not 
only to model the partition of the data into different clusters, but also to 
recover the underlying number of clusters. This recovery of the number of 
cluster is an issue usually evaded by many clustering algorithms that assume 
this information is somehow given to them as an input. This thesis showed 
how the challenging optimization of the CC functional lies at the core of 
several diverse applications, such as: sketching the common, unsupervised 
face identification and interactive segmentation. 

Focusing on the CC optimization example, this work also proposed sev- 
eral practical approximation algorithms. The CC functional was casted as 
a special case of an arbitrary Potts energy minimization. This discrete for- 
mulation of the functional allows to adapt and utilize known methods for 
discrete optimization resulting with CC optimization algorithms capable of 
handling large number of variables (two orders of magnitude more than ex- 
isting methods). 

Another example of challenging arbitrary energy is the task of 3D shape 
reconstruction. The problem of 3D reconstruction from two images with 
common lighting conditions and known object motion with respect to static 
camera is discretized to form a challenging energy. This energy emerges from 
the induced integrality constraints on the recovered surface. It is defined over 
a very large label space (a few thousand of labels). Despite the challenging 
pair-wise terms and the large label space successful approximation scheme 
was proposed and adapted for this task. 

When it comes to learning the parameters of discrete energies for various 
tasks in computer vision, it so happens that the underlying data may be 
better explained by contrast enhancing terms, as in the Chinese characters 
and the body parts. In fact when the learning procedure is not restricted 
it may prefer contrast-enhancing functions as better depictions for the un- 
derlying statistical behavior of the training set. Therefore, restricting the 
learned model to only submodular or semi-metric energies might severely 
compromise its ability to generalize to novel exemplars. 

Coping with the optimization of the resulting challenging energies is a 
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difficult task. Yet, this work proposes several methods that provide good 
empirical approximations for a variety of practical energies. Our unified dis- 
crete multiscale framework succeeds in providing good approximations, in 
practice, for a variety of challenging energies. Our multiscale framework ex- 
ploits the underlying multiscale landscape of the given energy and exposes it 
to allow for an efficient and effective coarse-to-fine optimization process. For- 
mulating the coarsening procedure in a clear algebraic formulation allows us 
to propose two paths to expose the multiscale landscape of the energy: one in 
which coarser scales involve fewer and coarser variables, and another in which 
the coarser levels involve fewer labels. We also propose adaptive methods for 
energy- aware interpolation between the scales. This approach exploits the 
underlying multiscale landscape of the given energy and exposes it to allow for 
an efficient and effective coarse-to-fine optimization process. Our framework 
provides the mathematical formulation that "bridges the gap" and relates 
multiscale discrete optimization and algebraic multiscale methods used in 
PDE solvers (e.g., Brandt (1986)). This connection allows for methods and 
practices developed for numerical solvers to be applied in multiscale discrete 
optimization as well. 
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